mediacloud / web-search

Code that drives the public web-based tools for the Media Cloud Online News Archive and Directory.
https://search.mediacloud.org
Apache License 2.0
8 stars 12 forks source link

integrate new user-agent in web fetches #636

Open rahulbot opened 4 months ago

rahulbot commented 4 months ago

The new v0.12.0 of mcmetadata includes mcmetadata.webpages.MEDIA_CLOUD_USER_AGENT. We want to integrate that as the user-agent in wherever this code requests content over http (for instance, when rescraping for RSS feeds).

pgulley commented 4 days ago

I don't think the web-search platform actually performs fetches anywhere- I think we've implemented this across the rest of the project- is this closable? @rahulbot

philbudne commented 4 days ago

I don't think the web-search platform actually performs fetches anywhere

(re)scraping?

last time I looked, the code had been duplicated. hopefully no longer the case?

rahulbot commented 4 days ago

This code uses feed-seeker (our package) to rescrape a source for new RSS feed URLs:

https://github.com/mediacloud/web-search/blob/5a232dfff4d1399ec5332cf0c35d277f6271a7f1/mcweb/backend/sources/tasks.py#L75C21-L75C39

Does feed-seeker accept an optional user-agent string?