Closed rahulbot closed 5 months ago
Current thoughts:
In the rss-fetcher's active feeds table have a "queued" column (values t/f) and
A query for queued = 't' AND last_fetch_attempt < a_while_ago could be used to find "stuck" entries (and alert to look for why this happened).
And to eliminate duplicate feeds, we could:
The idea of adding a queued
boolean for each feed table row to prevent duplicate fetch attempts is smart. However, I still feel funny about solutions that create state management in both a database and a queuing system. I don't have a good answer for ways around this, but it always makes me uneasy. But we can put that aside because this makes sense.
The proposed solution for dealing with duplicate feeds seems overly complicated to me for a problem that might not matter. Feed URLs come from two places*: (1) our system discovers them or (2) a user enters them. The first is far more common and can easily ignore a feed URL if it already exists. If a user enters and existing feed URL we can show an alert and prevent it, or allow them to override if that becomes useful (for some reason I can't think of). Why does this problem of potential for a small number of duplicate RSS feeds merit that amount of engineering and maintenance effort?
Far more likely is the scenario of a source that has the same feed served at two different URLs, because news CMSs do that sort of stuff. But we can't really do anything about that so I'm not putting that on our "problems needing solving list".
* I understand that the media merge is another source for feeds, but that is a special case of 1 in my brain - because we can enforce feed uniqueness as part of the media merge
The idea of adding a
queued
boolean for each feed table row to prevent duplicate fetch attempts is smart. However, I still feel funny about solutions that create state management in both a database and a queuing system. I don't have a good answer for ways around this, but it always makes me uneasy. But we can put that aside because this makes sense.
I understand the unease. If nothing else it's more bookkeeping that needs to be done right in any number of places. More tightly coupled management of the workers (eg as local subprocesses that return status information that is used to update the database) might make this less error prone.
Why does this problem of potential for a small number of duplicate RSS feeds merit that amount of engineering and maintenance effort?
Did my exposition of the current state of the database answer this?
You had been disinclined to model the source/feed relationship as many-to-many in the UI, and my fear is that left unchecked the same mess could redevelop even if the current situation was remedied.
My preliminary investigation while coding up the media merge program was that many feeds will still be duplicated across different sources even after the merge.
Yes, that data about 30%ish of "active" RSS feeds being duplicates is very revealing. We do want to avoid creating a similar mess again, so should we keep media-to-source as a one-to-many, but enforced at the UI level rather than as a database constraint? Doing it in the UI allows for future exceptions to be made as needed via an admin override.
Old issue. rss-fetcher no longer uses celery or rabbitmq.
Main program scripts/fetcher.py runs fetchers in directly forked sub-processes (connected via) pipes using "fetcher.direct" by subclassing fetcher.direct.Worker into FetcherWorker which calls "feed_worker" in the sub-process.
The term "direct" was chosen to connote "direct drive" sub-processes as opposed to "loosely coupled" processes connected via queues.
scripts/fetcher.py uses HeadHunter.refill to query for candidates to be fetched, ordered by "next_fetch_attempt" (time) column.
It looks like the "queued" column is checked in queries, but is no longer set True! (strictly vestigial)
We need some system to make sure this doesn't queue a feed multiple times. Perhaps this could just check the
fetch_events
to see if the latest item for a feed is aqueued
event?