mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

Prevent re-queueing of a feed already queued #9

Closed rahulbot closed 5 months ago

rahulbot commented 2 years ago

We need some system to make sure this doesn't queue a feed multiple times. Perhaps this could just check the fetch_events to see if the latest item for a feed is a queued event?

philbudne commented 1 year ago

Current thoughts:

In the rss-fetcher's active feeds table have a "queued" column (values t/f) and

  1. reset all table entries to queued = 'f' on (re)start
  2. set table entries to queued = 't' when queued to celery
  3. set queued = 'f' when the fetch attempt finishes.

A query for queued = 't' AND last_fetch_attempt < a_while_ago could be used to find "stuck" entries (and alert to look for why this happened).

philbudne commented 1 year ago

And to eliminate duplicate feeds, we could:

  1. Have two different feeds tables, one kept in sync with UI backend feeds table, with one additional column "cfurl"
  2. "cfurl" is the feed URL w/ http[s]: removed, all lower case, and any query parameters in sorted order
  3. use cfurl as the primary key to a fetcher-local "active feeds" table, entries added/removed (by DB trigger?) when the count of active entries goes from zero to one, or one to zero.
  4. keep all fetcher related data in the "active feeds" table, backup-rss-fetcher columns Feed table columns not needed: name, mc_feeds_id, mc_media_id.
rahulbot commented 1 year ago

The idea of adding a queued boolean for each feed table row to prevent duplicate fetch attempts is smart. However, I still feel funny about solutions that create state management in both a database and a queuing system. I don't have a good answer for ways around this, but it always makes me uneasy. But we can put that aside because this makes sense.

The proposed solution for dealing with duplicate feeds seems overly complicated to me for a problem that might not matter. Feed URLs come from two places*: (1) our system discovers them or (2) a user enters them. The first is far more common and can easily ignore a feed URL if it already exists. If a user enters and existing feed URL we can show an alert and prevent it, or allow them to override if that becomes useful (for some reason I can't think of). Why does this problem of potential for a small number of duplicate RSS feeds merit that amount of engineering and maintenance effort?

Far more likely is the scenario of a source that has the same feed served at two different URLs, because news CMSs do that sort of stuff. But we can't really do anything about that so I'm not putting that on our "problems needing solving list".

* I understand that the media merge is another source for feeds, but that is a special case of 1 in my brain - because we can enforce feed uniqueness as part of the media merge

philbudne commented 1 year ago

The idea of adding a queued boolean for each feed table row to prevent duplicate fetch attempts is smart. However, I still feel funny about solutions that create state management in both a database and a queuing system. I don't have a good answer for ways around this, but it always makes me uneasy. But we can put that aside because this makes sense.

I understand the unease. If nothing else it's more bookkeeping that needs to be done right in any number of places. More tightly coupled management of the workers (eg as local subprocesses that return status information that is used to update the database) might make this less error prone.

Why does this problem of potential for a small number of duplicate RSS feeds merit that amount of engineering and maintenance effort?

Did my exposition of the current state of the database answer this?

You had been disinclined to model the source/feed relationship as many-to-many in the UI, and my fear is that left unchecked the same mess could redevelop even if the current situation was remedied.

My preliminary investigation while coding up the media merge program was that many feeds will still be duplicated across different sources even after the merge.

rahulbot commented 1 year ago

Yes, that data about 30%ish of "active" RSS feeds being duplicates is very revealing. We do want to avoid creating a similar mess again, so should we keep media-to-source as a one-to-many, but enforced at the UI level rather than as a database constraint? Doing it in the UI allows for future exceptions to be made as needed via an admin override.

philbudne commented 5 months ago

Old issue. rss-fetcher no longer uses celery or rabbitmq.

Main program scripts/fetcher.py runs fetchers in directly forked sub-processes (connected via) pipes using "fetcher.direct" by subclassing fetcher.direct.Worker into FetcherWorker which calls "feed_worker" in the sub-process.

The term "direct" was chosen to connote "direct drive" sub-processes as opposed to "loosely coupled" processes connected via queues.

scripts/fetcher.py uses HeadHunter.refill to query for candidates to be fetched, ordered by "next_fetch_attempt" (time) column.

It looks like the "queued" column is checked in queries, but is no longer set True! (strictly vestigial)