mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Queue based fetcher #236

Closed philbudne closed 9 months ago

philbudne commented 9 months ago

Adds two new files, and some minor infrastructure changes (logging and "fast requeue") to put my crude queue-based fetcher "in tree", and available if/until needed.

The "theory of operation" is that once a story is queued, all state is maintained in the queue, and the fetcher runs continuously servicing the queue. If there is an outage, the queue will be loaded with all unfetched articles when the system is restarted.

All fetching work is done in threads (can use regular blocking code/libraries): the "scheduler" keeps a single lock on all critical data which is only required:

  1. to "issue" (start) a story
  2. when the fetch attempt has completed
  3. for the main thread, once a minute to do cleanup & reporting.

Stats are reported on every story handled (connection error, HTTP error, etc). If a fetch fails due to a "soft" error (DNS lookup error, connection failed, some HTTP 5xx codes) the story will be retried in an hour up to 12 times (scrapy tries once, at end of run?).

The scheduler keeps track (by fully qualified domain name) of:

  1. whether a connection has failed recently, and if so, defers fetching (to avoid repeated 30s timeouts on connect attempts)
  2. number of active fetches for the FQDN
  3. average latency of successful fetches, and uses that to try to issue enough requests to keep a fixed number active like scrapy does (or, alternatively to limit connection interval and concurrent connections).

If the a dequeued item can't be started ("issued") immediately, it's requeued to the "fetcher-fast" queue, and will reappear (at the end a short delay) at the end of the input queue. This puts load on RabbitMQ (when the queue contains nothing but URLs for a few larger domains), but avoids hairly/delicate bookkeeping in the fetcher (the current Worker class doesn't allow keeping a message "checked-out" (in the library sense) of the queue, and if it did, workers could only keep an unacknowledged story for 30 minutes before RabbitMQ has a fit (and closes the connection)).

Which fetcher (batch or queue) is used is controlled by deploy.sh (default in PIPELINE_TYPE is "batch-fetcher"; running deploy with -T queue-fetcher will deploy a regularly named indexer stack (dev/staging or prod) using the queue based fetcher instead of the batch-fetcher.