mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

Use feed <sy:updatePeriod> and <sy:updateFrequency> to set feeds.next_fetch_attempt #17

Open philbudne opened 1 year ago

philbudne commented 1 year ago

This details an idea suggested in Issue #8

16K of 45K recently fetched feeds provide tags:

15902 hourly 1
    171 hourly 30
     18 hourly None
     18 daily 1
      9 hourly 12
      8 hourly 60
      3 None None
      3 hourly 2
      2 hourly 3
      2 dayly 1
      1 hourly 6
      1 hourly 4
      1 hourly 0.1
      1 daily 2
      1 always 1

If we want to poll feeds more often than once a day, this provides an indication of how often the feed is likely to be updated see https://web.resource.org/rss/1.0/modules/syndication/

If used with https://github.com/mediacloud/backup-rss-fetcher/issues/16 we might not fetch the feed document each time we poll, so we would need to save some indication of the (next) poll interval to use in the feeds table.

Even without using having a fetch_interval column in the feeds table would allow linear or exponential back-off in the case of fetch failure: use current interval, then increase (by incrementing, multiplying by a constant or squaring) the stored value.