ukwa / ukwa-services

Deployment configuration for all UKWA services stacks.
Apache License 2.0
4 stars 5 forks source link

Twitter crawls not working properly. #99

Closed anjackson closed 1 year ago

anjackson commented 1 year ago

In recent weeks, we've been getting a LOT of odd 404s from Twitter.

Looking at e.g. Grafana, and the Recent Screenshots are all 404 images!

It took a while to investigate, but it turned out that essentially all the web-rendered Twitter seeds were not working, and returning a 404 even when the profile in question was valid and returned a 200 when viewed with a web browser.

I initially assumed we were being blocked as a robot, but it was unclear how we were being identified. e.g. I ran requests from the same IP address, same User Agent (even with precisely the same set request headers), but from curl, and got 200 where the crawler was seeing 404.

Eventually, I determined that calls going via warcprox were going wrong, whereas running the render service directly against the live web worked fine. The warcprox service has been the same for a while, and was built under python:3.7-slim, so I started experimenting with it and seeing if I could bring it up to date. Upon rebuilding it under python:3.10-slim, the problem disappeared!

It's not clear exactly what happened, but as we ruled out the HTTP request, headers and IP address, the only thing left is the TLS layer. Presumably, something in the older stack (lack of support for newer TLS? Old certs?) was making Twitter unhappy. Why this was returned as a 404 remains a mystery.

While working to try and fix this, I also ended up deploying an updated ukwa/webrender-puppeteer:2.3.3, along with the new ukwa/docker-warcprox:2.7.14.2. This now appears to be behaving much better.

-- p.s. Also considered a shift to Browsertrix Crawler for Twitter URLs. See here for some configuration needed to override WARC names.

while this may be needed in the future, this is a much larger change so will not be pursued at this time.

anjackson commented 1 year ago

The Recent Seeds look much healthier now.