mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

Use HTTP etag and last-modified headers to avoid fetching unchanged feed files #16

Closed philbudne closed 4 months ago

philbudne commented 1 year ago

This details an idea suggested in Issue #8

HTTP provides two mechanisms to inhibit re-fetching unchanged pages: Last-Modified/If-Modified-Since and ETag/If-None-matched. In both cases, an HTTP GET provides the first header (Last-Modified and ETag) that can be saved in the feeds row, and presented on subsequent GET requests (via If-Modified-Since and If-None-Matched).

At least 36K of 45K recently fetched feeds provide at least one, and the PyPI feedparser module makes it trivial to use either/both, IF you pass feedparser.parse a URL. However feedparser.parse does not take a timeout. Two possibilities:

1) Pass feedparser.parse a URL (and have it perform all the HTTP interactions) and set a Unix alarm to deliver a signal which raises a Python exception to abort the fetch:

lass AlarmException(Exception):
    """
    class for alarm timeout exception
    """

def alarm_handler(signum, frame):
    """
    signal handler for SIGALRM (setitimer/alarm)
    """
    raise AlarmException()

signal.signal(signal.SIGALRM, alarm_handler)
try:
    signal.alarm(TIMEOUT)
    fpret = feedparser.parse(URL, etag=feeds.etag, modified=feeds.http_modified)
    signal.alarm(0)
    if fpret.status == 304:
    # feed has not changed: count as success!
    ....
    elif fpret.status == 200:
    feeds.etag = fpret.etag
    feeds.http_modified = fpret.modified
    else:
        # failed
        ....
except AlarmException:
    # request timed out

2) Add goo to the requests call (or change to some other HTTP interface if not possible).

rahulbot commented 1 year ago

Right now the code uses requests to do the HTTP GET, and then passes the content returned to feedparser. That is so the we can:

This could be re-architected of course, but that is how the process is right now.

philbudne commented 1 year ago

Regarding the "fetched file hash"; the current production system calculates hashes on just the URLs extracted from the feed. Conditional fetches offer the possibility of eliminating the transfer, making life better for both us, and the source.

feedparser hands back HTTP status, so the adjustments may be small (or not); only experimentation will show for sure!

rahulbot commented 4 months ago

The comments on the PR make this sound completed. Closing.