mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

indexer/workers/hist-fetcher.py: try to detect non-existent objects #332

Closed philbudne closed 3 months ago

philbudne commented 3 months ago

Right now when processing a "historic" CSV file where HTML is to be found by legacy system downloads_id in an S3 bucket, if the object doesn't exist:

  1. Nothing is logged
  2. No counter is incremented
  3. The story is retried (in an hour, or more)
  4. Retries are repeated until the story is quarantined

This change tries to detect non-existent objects by having hist-fetcher examine the boto ClientError exception, logging and counting as "not-found" and discarding the story if the error was "404"