learningequality / ricecooker

Python library for creating Kolibri channels and uploading to Studio
https://ricecooker.readthedocs.io/
MIT License
17 stars 52 forks source link

No explicit header set for the DOWNLOAD_SESSION #483

Open rtibbles opened 3 months ago

rtibbles commented 3 months ago

Observed behavior

The DOWNLOAD_SESSION that is used to download resources sets no explicit header - this proves to be an issue, for example, when downloading from wikimedia sites, because of their User Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy

Expected behavior

Ideally, we would follow the kind of User-Agent that the wikimedia policy spells out - we already retrieve the email for the user whose API token we are running with from Studio, so we should reuse this to set the header.

With that in place, we would then do the following for the User Agent:

f"Ricecooker/{ricecooker.__version__} bot ({user_email})"

User-facing consequences

Attempts to scrape without setting these headers may be treated as malicious.

Steps to reproduce

Attempt to download any file from wikimedia

Context

Ricecooker develop branch

nikkuAg commented 3 months ago

Hey, is this issue still open? I would like to work on this

rtibbles commented 3 months ago

Absolutely @nikkuAg - I will assign you, thanks for volunteering!

MisRob commented 3 weeks ago

Hi @nikkuAg, are you still planning to work on this?