mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Further tweaking of User-Agent string? #83

Closed philbudne closed 9 months ago

philbudne commented 9 months ago

It seems that NPR's CDN doesn't like the new UA string @NullPxl came up with, if it comes from our IP (at UMass). It looks like it works if I add something like " Firefox/47.0" to the end. This increases our level of deception (and might only work temporarily). Should we consider doing this after consulting with researchers? Is it worth reaching out to NPR web folk?

NullPxl commented 9 months ago

NPR seems to use Akamai GHost (specifically bot manager), so my guess is that making small changes to the UA string will only be a temporary fix. Trying to continuously fight Akamai's detection negates the original goal of honesty and being good internet citizens, so in my opinion we should reach out to whoever handles NPR's network security to see if the UMass IP can be put on an allowlist.

rahulbot commented 9 months ago

FYI, IA got back and the fetcher they're using for the Media Could URL feed is Mozilla/5.0 (compatible; [archive.org](http://archive.org/)_bot +http://archive.org/details/archive.org_bot). They have a robust system for fetching from different IPs, but I don't remember the details.

rahulbot commented 9 months ago

I think @NullPxl make a good case for keeping things as is for now (with this more standard-looking new UA). Let's hand-off the idea of npr them to research team and see if they have capacity/motivation to take it on. Closing for now. Please re-open if this UA behaves in any newly odd.