mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

would be useful to have a uniform, durable property store #203

Open philbudne opened 10 months ago

philbudne commented 10 months ago

All current paths for processing of stories, past and present:

  1. HTTP fetching whole days based on rss-fetcher generated RSS files (S3 mediacloud-public/backup-daily-rss/)
  2. Fetching articles processed from legacy system (html in S3 bucket mediacloud-downloads-backup, CSVs in mediacloud-database*files buckets)
  3. Processing WARC archives (S3 bucket mediacloud-indexer-archive)

involve processing processing S3 objects (named files) of batches of stories (or pointers to stories).

For ALL of the above it would be useful to have a durable record of what has and has not been processed, so that processing can be automated, reliable, and free of human monitoring and intervention, which suggests (to me) having a uniform way (single piece of code) to record what processing has been started and completed.

Possibilities:

  1. Since all paths above involve files (objects) on Amazon S3, to leave a record in S3 of processing. Small S3 objects (one per input object), either in the input bucket, or in a special bucket (for easy disposal) are likely to be inexpensive.
  2. Usage may be low enought that AWS "SimpleDB" would be very cheap, or even free.
  3. Since the record applies to the particular Elastic Search cluster in which the articles are being (re)indexed, and the ES cluster is redundant (multiple shard replicas distributed across storage nodes), and the data are small, perhaps an ES index that records what has been indexed would be a reasonable solution?
rahulbot commented 10 months ago

Clearly a good idea for avoiding duplicate processing and also creating an audit trail for ourselves to validate. In terms of possible implementations, I'm in favor of whichever approach to implementing is as simple as possible (so it doesn't introduce its own set of bugs).

To prioritize I think we have to decide if it OK to start #198 and #168 without this. Do they require this to be in place in order to be reliable and validated?

philbudne commented 10 months ago

I coded a quick "tracker" using gdbm, which uses a local file, and does basic locking that prevents concurrent access, but it's a separate class, in a separate file, and can be replaced.

philbudne commented 9 months ago

A less cheesy implementation (that could be more easily backed up) would be to replace gdbm with SQLite3 (using SQLAlchemy as an ORM, or directly via SQL)

rahulbot commented 6 months ago

@philbudne does this need more work based on your experience with 2023, or is it fine as is to use for 2022 (#271)?

philbudne commented 6 months ago

Currently using SQLite3... My latest thinking is to use an ES3 index!

Fine to use as-is for 2022.

philbudne commented 6 months ago

My current thoughts:

philbudne commented 5 months ago

Update on above: rss-puller running in production (since April 24?)