Issue scanner: be more careful with recently-loaded batches

jechols commented 6 months ago

NCA makes an effort to be nearly real-time with its knowledge of the batches and issues that are live, on disk, in the NCA workflow, etc. But batches that are loaded to production can take up to a week to be reindexed. Sometimes they’ll be nearly instant, sometimes it’ll take a day, sometimes it’ll take a week…. The root problem is that NCA tries to index data when a batch load is going on at the same time, and gets a partial view of the batch data – which it then caches until the next full rebuild.

The good news is that NCA does a weekly rebuild of all cached data, so this kind of problem magically goes away. The bad news, of course, is that this is definitely unexpected (and therefore buggy) behavior.

NCA shouldn't cache batch data for batches that are in the process of loading. The problem is that this is not cached in such a straightforward manner: the caching occurs at the HTTP level, when fetching JSON from ONI. We just say "scan batches.json, then scan every batch URL that JSON file has in it." There's no direct tie to the batch being processed.

jechols commented 6 months ago

Note that #25 would likely improve or eliminate this; if this by itself is a big task, a full refactor to the issue watcher may need to be prioritized.

jechols commented 3 months ago

Possible solution: use the "ingested" value in batches.json to decide if we even want to read the data in a batch.\

src/chronam/json.go: The BatchMetadata struct needs a new field
src/issuefinder/web.go: in FindWebBatches, when looping over the batch metadata, skip any batch with too recent an ingest time - 24 hours or something?

Downsides?

Batches are used for dupe-checking. This would give a new window of time for dupes to get into the system.
- Counterpoint: the current system allows dupes. If a batch is half-read, the unread issues can be duplicated for up to a week.

uoregon-libraries / newspaper-curation-app

Issue scanner: be more careful with recently-loaded batches #310