Closed philbudne closed 3 months ago
Generated an RSS file with 324868 unique lines from input files (may have duplicate URLs if the story URL came in from different feeds, or with a different published date in the RSS file), and queued to the rss-indexer stack on posey (that's still chewing on the dregs from May 2021 RSS files).
Unfortunately, I didn't think to randomize the entire file which was sorted (by URL), and the queuer only randomizes batches of 2500 stories, so the stream is very "chunky" with regard to domains. This exposes a weakness of tqfetcher:
2024-08-11 20:02:00,060 3a441b4e6402 fetcher[Main] INFO: 8 active, 8 sites, 0 ready, 120 delayed, 30 recent, lavg 1.38
once a minute report shows that there are only eight active threads (64 available) even though the fetcher has recently seen messages for 30 different domains, most of the 128 messages it's allowed to keep on hand are "delayed" waiting to be issued (to keep space between requests) likely from a small number of domains... The processing rates seems to be about 5 stories/second, and the file would be a "small day" for production.
I ran rss-fetcher on AWS EC2 instances during two power outages, and I have a script to isolate just the stories not found when the lights came back on.