mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

would generating files 2-3 times a day shorten time-to-search-availability for users? #33

Closed rahulbot closed 2 months ago

rahulbot commented 5 months ago

One of our goals is to produce "timely" data for researchers studying unfolding events. Right now stories become available for search around 1.5-2 days after publication. I think our primary bottlenecks are (a) rss-fetcher generating synthetic file for importer once a day and (b) story-indexer taking around 8-12 hours on average to process that file. (b) is already being worked on, but for (a) I'm wondering how complex it would be to have rss-fetcher generate multiple files a day to speed things up? We don't need a "realtime" system, but also want to make any small changes we can to inform the "unfolding events" use case work better. This is in the bucket of future work, but I wanted to capture the idea to ponder.

philbudne commented 5 months ago

I'd been thinking that rather than generating multiple files, the rss-fetcher could have a process (run via cron at whatever interval is desired) that wrote batches of stories as new line separated JSON to a RabbitMQ queue. This would require a database "property" to keep track of the last story queued.

A story-indexer worker would wait on the queue, pick up the JSON and queue them as Story objects for input to the queue-based fetcher.

rahulbot commented 5 months ago

Interesting. I guess that would more closely couple the story-indexer and rss-fetcher. So a job could potentially run every hour to push a batch of new story info to a queue, and then a task in story-indexer could monitor that queue to pull off data and turn it into new Story objects for further propagation into it's pipeline. That feels file with our current approach to integration, positioning rss-fetcher almost as "stage 0" of the story-indexer pipeline.

In that model we'd still want the synthetic daily RSS file to be generated and published for transparency, archival reasons, and to support other integrations (like wayback machine).

rahulbot commented 4 months ago

Moved up in priority, since it is tied to some deliverables and descriptions in the new grant.

philbudne commented 4 months ago

My latest thinking is that if the rss-fetcher had an /api/stories endpoint that returned complete rows from the stories table, the story-indexer could have a periodic process that queried for batches of new stories using WHERE id > x (where x is that last row returned) would be quick and unambiguous).

philbudne commented 2 months ago

replaced by https://github.com/mediacloud/story-indexer/issues/274