uoregon-libraries / newspaper-curation-app

Suite of front- and back-end tools for the curation of digitized newspaper materials
Apache License 2.0
8 stars 1 forks source link

Issue scanner: be more careful with recently-loaded batches #310

Open jechols opened 4 months ago

jechols commented 4 months ago

NCA makes an effort to be nearly real-time with its knowledge of the batches and issues that are live, on disk, in the NCA workflow, etc. But batches that are loaded to production can take up to a week to be reindexed. Sometimes they’ll be nearly instant, sometimes it’ll take a day, sometimes it’ll take a week…. The root problem is that NCA tries to index data when a batch load is going on at the same time, and gets a partial view of the batch data – which it then caches until the next full rebuild.

The good news is that NCA does a weekly rebuild of all cached data, so this kind of problem magically goes away. The bad news, of course, is that this is definitely unexpected (and therefore buggy) behavior.

NCA shouldn't cache batch data for batches that are in the process of loading. The problem is that this is not cached in such a straightforward manner: the caching occurs at the HTTP level, when fetching JSON from ONI. We just say "scan batches.json, then scan every batch URL that JSON file has in it." There's no direct tie to the batch being processed.

jechols commented 3 months ago

Note that #25 would likely improve or eliminate this; if this by itself is a big task, a full refactor to the issue watcher may need to be prioritized.

jechols commented 1 month ago

Possible solution: use the "ingested" value in batches.json to decide if we even want to read the data in a batch.\

Downsides?