In my work to get the requests based queue fetcher in story-indexer returning the same articles as the Scrapy based batch fetcher, I found I needed to perform several tweaks (headers and SSL/TLS related). The code is currently in https://github.com/philbudne/story-indexer/blob/main/indexer/requests_arcana.py but I'd like to be able to use it in rss-fetcher (fetching rss files), and in web-search (searching for sitemap candidates) without duplicating the code.
In moving to metadata-lib, it will need to take a User-Agent string to use as an argument (it could default to
mcmetadata.urls.DEFAULT_USER_AGENT, but we'll always pass MEDIA_CLOUD_USER_AGENT.
The file already contains warnings about the code being fragile and unclean (bypassing mypy complaints when monkey-patching etc), and that it should NEVER be used when you actually care about privacy/security, but It's possible those issues (especially the last) need to be stated even more plainly (like naming the function insecure_legacy_session???)
I'm otherwise agnostic about where it belongs, or exactly what it's called.
In my work to get the requests based queue fetcher in story-indexer returning the same articles as the Scrapy based batch fetcher, I found I needed to perform several tweaks (headers and SSL/TLS related). The code is currently in https://github.com/philbudne/story-indexer/blob/main/indexer/requests_arcana.py but I'd like to be able to use it in rss-fetcher (fetching rss files), and in web-search (searching for sitemap candidates) without duplicating the code.
In moving to metadata-lib, it will need to take a User-Agent string to use as an argument (it could default to mcmetadata.urls.DEFAULT_USER_AGENT, but we'll always pass MEDIA_CLOUD_USER_AGENT.
The file already contains warnings about the code being fragile and unclean (bypassing mypy complaints when monkey-patching etc), and that it should NEVER be used when you actually care about privacy/security, but It's possible those issues (especially the last) need to be stated even more plainly (like naming the function
insecure_legacy_session
???)I'm otherwise agnostic about where it belongs, or exactly what it's called.