Closed rahulbot closed 2 months ago
I'd been thinking that rather than generating multiple files, the rss-fetcher could have a process (run via cron at whatever interval is desired) that wrote batches of stories as new line separated JSON to a RabbitMQ queue. This would require a database "property" to keep track of the last story queued.
A story-indexer worker would wait on the queue, pick up the JSON and queue them as Story objects for input to the queue-based fetcher.
Interesting. I guess that would more closely couple the story-indexer and rss-fetcher. So a job could potentially run every hour to push a batch of new story info to a queue, and then a task in story-indexer could monitor that queue to pull off data and turn it into new Story objects for further propagation into it's pipeline. That feels file with our current approach to integration, positioning rss-fetcher almost as "stage 0" of the story-indexer pipeline.
In that model we'd still want the synthetic daily RSS file to be generated and published for transparency, archival reasons, and to support other integrations (like wayback machine).
Moved up in priority, since it is tied to some deliverables and descriptions in the new grant.
My latest thinking is that if the rss-fetcher had an /api/stories
endpoint that returned complete rows from the stories table, the story-indexer could have a periodic process that queried for batches of new stories using WHERE id > x
(where x is that last row returned) would be quick and unambiguous).
One of our goals is to produce "timely" data for researchers studying unfolding events. Right now stories become available for search around 1.5-2 days after publication. I think our primary bottlenecks are (a) rss-fetcher generating synthetic file for importer once a day and (b) story-indexer taking around 8-12 hours on average to process that file. (b) is already being worked on, but for (a) I'm wondering how complex it would be to have rss-fetcher generate multiple files a day to speed things up? We don't need a "realtime" system, but also want to make any small changes we can to inform the "unfolding events" use case work better. This is in the bucket of future work, but I wanted to capture the idea to ponder.