Open philbudne opened 10 months ago
Clearly a good idea for avoiding duplicate processing and also creating an audit trail for ourselves to validate. In terms of possible implementations, I'm in favor of whichever approach to implementing is as simple as possible (so it doesn't introduce its own set of bugs).
To prioritize I think we have to decide if it OK to start #198 and #168 without this. Do they require this to be in place in order to be reliable and validated?
I coded a quick "tracker" using gdbm, which uses a local file, and does basic locking that prevents concurrent access, but it's a separate class, in a separate file, and can be replaced.
A less cheesy implementation (that could be more easily backed up) would be to replace gdbm with SQLite3 (using SQLAlchemy as an ORM, or directly via SQL)
@philbudne does this need more work based on your experience with 2023, or is it fine as is to use for 2022 (#271)?
Currently using SQLite3... My latest thinking is to use an ES3 index!
Fine to use as-is for 2022.
My current thoughts:
Update on above: rss-puller running in production (since April 24?)
All current paths for processing of stories, past and present:
involve processing processing S3 objects (named files) of batches of stories (or pointers to stories).
For ALL of the above it would be useful to have a durable record of what has and has not been processed, so that processing can be automated, reliable, and free of human monitoring and intervention, which suggests (to me) having a uniform way (single piece of code) to record what processing has been started and completed.
Possibilities: