mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

threaded queue fetcher limitations #281

Open philbudne opened 6 months ago

philbudne commented 6 months ago

A brain dump of where I left off.

tqfetcher behaves reasonably (fetches 10-15 stories/second) given a well mixed input queue. But if the workload is not well mixed (ie; historical URLs from a single site) are dropped into the input queue all at once, the thruput can drop to 1 story every 5 seconds.

RabbitMQ applications can specify the number of messages to be delivered to it until one of the pending messages is acknowledged (and removed from the input queue). The number is called the prefetch.

tqfetcher keeps a moving average of past req request times, but does not schedule fetches based on the available threads, rather it optimistically delays requests based on past request time and the time of the last request sent. When a past estimate is not available (average is zero), requests are not delayed.

After delay (or no delay) stories are placed into a work queue for distribution to worker threads. If the work queue accumulates ANY significant number of ready requests, the ability to control delay between requests to a site is SEVERELY compromised. Because of this the prefetch is kept very low (two messages per available worker thread).

Thoughts for improvements:

Right now, tqfetcher's MINIMUM delay between starting requests to any one site is 5 seconds. THIS IS EXCEEDINGLY CONSERVATIVE! Scrapy may not have had ANY minimum delay!!! HOWEVER: very low delay numbers would allow VERY large numbers of requests to a site to be delayed (eating up the prefetch). A fix for this might be:

Limit the number of stories that can be delayed for a particular site to a "fair share" of the prefetch (perhaps prefetch / active_sites or prefetch / (active_sites + 1), where active_sites is the number of sites successfully fetched from in the last minute (counted in the once a minute periodic task).

Enforce a delay between first requests to any previously uncontacted site? and/ Check the work queue length before adding a request (more than one request per thread is excessive).

The grandest idea/wish is to plan out all activity (within the next fast delay queue period interval) so that the work queue never gets large, and requests are delivered to the work queue at a measured rate that never exceeds available capacity. Initial requests to uncontacted sites should be treated as taking a long time (MAX_CONNECT + MAX_READ seconds?).

Consider whether it's possible to detect if ALL requests are failing due to Internet outage. The worst case is that if the outage is more than 10 hours, stuff gets dumped into the fetcher-quar(antine) queue, and has to be manually moved.

philbudne commented 6 months ago

(one of a no doubt continuing series of things I forgot):

Half-baked idea: If large number of requests seen for a particular site (large number of requests couldn't be delayed in the last minute), consider lowering the minimum interval for that site (requires "fair share" to not hog the prefetch).

This seems reasonable for sites that generate large numbers of URLs per day, but less reasonable if we want to back fill for a site, and load the queue with old URLs.

There is code that attempts to back off when a 429 response is seen, but it's not well tested.

philbudne commented 5 months ago

Another issue/thought (which in theory applies to all queue workers, but in practice only really effects tqfetcher):

When tqfetcher sees that a site/domain is "fully booked" with fetches for the next two minutes, it shunts any additional input requests to the "fetcher-fast" queue, where it will hang out for two minutes, and then be appended to the regular "fetcher-input" queue.

In normal operation (hourly batches from rss-queuer), the new requests are run thru in the first part of the hour (shunting unreachable, and other "soft" errors to the -retry queue (after which they'll end up at the end of the -input queue).

Something I coded for, from the very start, is to allow any queue worker to take input from multiple queues (so that retries can be processed as soon as their delay period ends, rather than going to the back of the line). The sticky bit is that ISTR that a Pika/RabbitMQ channel can only get messages from one queue, so I made sure that worker objects and channels are not wired into the code as 1:1.

The place this MAY be visible right now: The fetches from old rss/csv files are taking over a day: My guess is that the first pass thru the queue is taking less than a day, but that:

  1. running thru the retries (which may or may not be due to the nature of old links) an hour or more
  2. the number of retries keeps the queue abover 100K entries (the point at which another file of stories will be added to the queue, to avoid treating RabbitMQ as a bottomless vessel, which is important on systems that don't have the queues persisted on a huge RAID volume)
  3. it takes a couple of hours for the queue (now containing "dregs") to go below 100K entries and get "fresh meat" added in
  4. I'm not sure lowering the threshold will help: it will allow more dregs to accumulate, more quickly (tho perhaps better mixed with the older retries).

The bottom line: Consider handling -retry and -fast message handling differently: Instead of having them appended to -input after their time in purgatory, put them into ANOTHER queue that's served with equal (or higher) thruput/prefetch to/than the -input queue so that they're retried "on time" rather than waiting for the new entries added to the -input queue (during their time in purgatory) to be drained before getting another chance...