mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

be smarter about how frequently to update each feed #8

Closed rahulbot closed 5 months ago

rahulbot commented 2 years ago

It sounds like there are some frequency information fields in RSS. Let's investigate those and see if they can inform setting a smarter polling frequency rate.

philbudne commented 2 years ago

From https://web.resource.org/rss/1.0/modules/syndication/

updatePeriod Describes the period over which the channel format is updated. Acceptable values are: hourly, daily, weekly, monthly, yearly. If omitted, daily is assumed.

updateFrequency Used to describe the frequency of updates in relation to the update period. A positive integer indicates how many times in that period the channel is updated. For example, an updatePeriod of daily, and an updateFrequency of 2 indicates the channel format is updated twice daily. If omitted a value of 1 is assumed.

updateBase Defines a base date to be used in concert with updatePeriod and updateFrequency to calculate the publishing schedule. The date format takes the form: yyyy-mm-ddThh:mm

rahulbot commented 1 year ago

Suggestion on how to pursue this? Perhaps we should we audit a large set of randomly selected feeds to see how widespread adoption is before continuing do design an architecture that supports this.

philbudne commented 1 year ago

As for determining how often the fields of interest are provided, I was thinking it would be nice to do a run where the fetcher saves the feed contents in a file named based on the feed id, to provide a corpus for answering any number of different questions. I'd also be interested in seeing the HTTP headers returned, for example to see how many supply "Last-Modified" and "ETag" (tho the former may not be an indication of whether the server honors conditional fetches).

I was thinking the simplest way to allow both per-feed rates, and to allow human triggered "fetch this feed soon" would be a next_fetch_time/deadline field in the table: that allows the feed to reschedule the next check itself, adjusting depending on any number of factors:

And when a human wants to trigger fetching the deadline field can be set to the zero (or datetime epoch origin).

And have the query to SELECT feeds based on queued != 't' and use "ORDER BY deadline"

And to:

  1. reset all table entries to queued = 'f' on (re)start
  2. set table entries to queued = 't' when queued to celery
  3. set queued = 'f' when the fetch attempt finishes.

A query for queued = 't' AND last_fetch_attempt < a_while_ago could be used to find "stuck" entries (and alert to look for why this happened).

rahulbot commented 1 year ago

A couple of ideas in there to consider:

  1. File Logging I think that's a good idea. It could be instrumented by adding a new file-logging env-var boolean to activate a mode where the system saves the latest RSS text file to disk (named by mc_feed_id?), with header info. Can you split this off into a new issue and we can decide which of us should jump on adding that.

  2. Feed-Level Fetch Frequency Control The current last_fetch_failures is clearly a cludge. The idea of saving something to indicate the next time the system should fetch the feed is a much more flexible design, for the reasons you describe. Making this a next_fetch_attempt_deadline is reasonably easy to understand, and we can easily default it at a system level to 1 day.

  3. Preventing Duplicate Fetch Collisions A good use of the proposal solution from #9.

Overall this is a smart, easy to understand, and should lend itself to reasonably quick database querying to list the next feeds to fetch. Would this be in lieu of the proposed fetch_events table we had discussed previously to track "jobs", or in addition?

philbudne commented 1 year ago
  1. File Logging

Created https://github.com/mediacloud/backup-rss-fetcher/issues/14

  1. Feed-Level Fetch Frequency Control The current last_fetch_failures is clearly a cludge. The idea of saving something to indicate the next time the system should fetch the feed is a much more flexible design, for the reasons you describe. Making this a next_fetch_attempt_deadline is reasonably easy to understand, and we can easily default it at a system level to 1 day.

The failure count is still useful, both for calculating the next deadline and for deciding when to give up trying. Otherwise we would have to query the activity log on each failure to see the past history.

  1. Preventing Duplicate Fetch Collisions A good use of the proposal solution from #9.

Overall this is a smart, easy to understand, and should lend itself to reasonably quick database querying to list the next feeds to fetch. Would this be in lieu of the proposed fetch_events table we had discussed previously to track "jobs", or in addition?

I was thinking in addition to the history/activity log, which I guess I was thinking of as purely for human consumption, although having a "started" status, which is modified into succeeded or failed when the job ends would keep the state information (at the cost of a hairier query).

P.S. I was curious about the history of "hairy" for difficult. It turns out to have a long history: https://english.stackexchange.com/questions/121134/why-do-we-describe-a-problem-or-experience-as-hairy

philbudne commented 5 months ago

Implemented (and perhaps revised) long ago. Poll frequency is adaptive, based on duplicates seen (too few duplicates is taken to mean we're not polling often enough, and are at risk of missing stories, while too many means we can poll less often). Logic is in tasks.py _check_auto_adjust