mediacloud / metadata-lib

How Media Cloud approaches extracting metadata from online news stories
Apache License 2.0
12 stars 5 forks source link

Add routine to return configured requests.Session object #88

Closed philbudne closed 1 month ago

philbudne commented 2 months ago

In my work to get the requests based queue fetcher in story-indexer returning the same articles as the Scrapy based batch fetcher, I found I needed to perform several tweaks (headers and SSL/TLS related). The code is currently in https://github.com/philbudne/story-indexer/blob/main/indexer/requests_arcana.py but I'd like to be able to use it in rss-fetcher (fetching rss files), and in web-search (searching for sitemap candidates) without duplicating the code.

In moving to metadata-lib, it will need to take a User-Agent string to use as an argument (it could default to mcmetadata.urls.DEFAULT_USER_AGENT, but we'll always pass MEDIA_CLOUD_USER_AGENT.

The file already contains warnings about the code being fragile and unclean (bypassing mypy complaints when monkey-patching etc), and that it should NEVER be used when you actually care about privacy/security, but It's possible those issues (especially the last) need to be stated even more plainly (like naming the function insecure_legacy_session???)

I'm otherwise agnostic about where it belongs, or exactly what it's called.