mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Generate & run RSS files for two July UMass power outages #323

Closed philbudne closed 3 months ago

philbudne commented 3 months ago

I ran rss-fetcher on AWS EC2 instances during two power outages, and I have a script to isolate just the stories not found when the lights came back on.

  1. generate two RSS files
  2. run them (on posey) when 2022 rss fetch is done
  3. add rss files to public bucket
philbudne commented 3 months ago

Generated an RSS file with 324868 unique lines from input files (may have duplicate URLs if the story URL came in from different feeds, or with a different published date in the RSS file), and queued to the rss-indexer stack on posey (that's still chewing on the dregs from May 2021 RSS files).

Unfortunately, I didn't think to randomize the entire file which was sorted (by URL), and the queuer only randomizes batches of 2500 stories, so the stream is very "chunky" with regard to domains. This exposes a weakness of tqfetcher:

2024-08-11 20:02:00,060 3a441b4e6402 fetcher[Main] INFO: 8 active, 8 sites, 0 ready, 120 delayed, 30 recent, lavg 1.38

once a minute report shows that there are only eight active threads (64 available) even though the fetcher has recently seen messages for 30 different domains, most of the 128 messages it's allowed to keep on hand are "delayed" waiting to be issued (to keep space between requests) likely from a small number of domains... The processing rates seems to be about 5 stories/second, and the file would be a "small day" for production.