mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

fetcher issues from a power outage #319

Open philbudne opened 4 months ago

philbudne commented 4 months ago

To avoid queuing excessive work to a queue (and possibly exhausting available disk when running a stack on a system without a large array), message producers check each of the possible (fanout) destination -in queue lengths.

When just such a stack (processing old rss files) came up from a power outage, AND the Internet was not reachable, the entire contents of the fetcher-in queue quickly ended up in the -retry queue (for retry in an hour), and ANOTHER batch was loaded into the (now empty) -in queue. Had the Internet outage lasted a longer time, the queue lengths could have grown to REALLY unreasonable lengths.

If the sum of fetcher-in and fetcher-retry queues was considered in Producer.check_output_queues, this could be avoided. The downside is that items in the -retry queue won't go back to the -in queue for an hour, so there might be an idle hour before work restarts.

This only REALLY applies when old page content (url's from old CSV or RSS files) is being fetched from the Internet. For "current day" workloads, an Internet outage means there won't be new work from the rss-fetcher, BUT if, for some reason the fetcher (or the entire indexer stack) is not running, the problem WILL occur.

Another take on this would be to have the fetcher detect that ALL fetch requests (to all domains) are failing, and to slow down request processing. The "to all domains" is the tricky part. When the queue contains JUST requests that are being retried because their server can't be reached, that could look like "the Internet is down". I suppose a test for "Is the Internet reachable" could include trying to fetch some well-known pages, or ping some well known IPs (ie; 8.8.8.8).