openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
20 stars 18 forks source link

Pass proper user-agent in `stream_file` when host is `upload.wikimedia.org` #214

Open benoit74 opened 1 week ago

benoit74 commented 1 week ago

For files hosted on upload.wikimedia.org, we must comply with their User-Agent policy at https://meta.wikimedia.org/wiki/User-Agent_policy

Doing so at scraperlib level in stream_file (main methods using in many scraper to download files / assets) would help avoid having to do so in every scraper (and forget about it over and over).

benoit74 commented 1 week ago

That being said, I'm not sure this is really straightforward to implement.

Scraper should pass its name and version to scraperlib so that we set properly the header

And we also need a contact, which is probably more related to who ran the scraper

Not sure this is so easy to implement in the end.