Closed philbudne closed 4 months ago
Right now the code uses requests
to do the HTTP GET, and then passes the content returned to feedparser
. That is so the we can:
FetchEvent
sThis could be re-architected of course, but that is how the process is right now.
Regarding the "fetched file hash"; the current production system calculates hashes on just the URLs extracted from the feed. Conditional fetches offer the possibility of eliminating the transfer, making life better for both us, and the source.
feedparser hands back HTTP status, so the adjustments may be small (or not); only experimentation will show for sure!
The comments on the PR make this sound completed. Closing.
This details an idea suggested in Issue #8
HTTP provides two mechanisms to inhibit re-fetching unchanged pages: Last-Modified/If-Modified-Since and ETag/If-None-matched. In both cases, an HTTP GET provides the first header (Last-Modified and ETag) that can be saved in the
feeds
row, and presented on subsequent GET requests (via If-Modified-Since and If-None-Matched).At least 36K of 45K recently fetched feeds provide at least one, and the PyPI feedparser module makes it trivial to use either/both, IF you pass
feedparser.parse
a URL. However feedparser.parse does not take a timeout. Two possibilities:1) Pass feedparser.parse a URL (and have it perform all the HTTP interactions) and set a Unix
alarm
to deliver a signal which raises a Python exception to abort the fetch:2) Add goo to the
requests
call (or change to some other HTTP interface if not possible).