For bagit, calculate SHA hashes in advance

jechols commented 3 years ago

Problem: it takes ages to build a batch's bagit data, especially over a network-mounted filesystem. This is almost entirely due to the amount of time it takes to run the SHA hashing of all the files.

If we compute each issue's hash data before batches are generated (when files are created or pulled into NCA for the first time? After metadata review happens?), the cost is spread out instead of being incurred all at once. This would also make a dramatic difference when we have to reject a single issue and then requeue a batch. Waiting for thousands of SHAs to be computed that had already been computed previously is just painful.

jechols commented 3 years ago

Of course this begs the question: where do we put these hashes? Database? Filesystem?

jechols commented 6 months ago

Recently we had an issue with this again; time to spend a little bit to see if any quick wins can be found

jechols commented 6 months ago

Thoughts:

Compute SHA after derivative generation?
- Con: needs a job in two places since two Queue... functions make derivatives
- Con: Not everything that has derivatives created ends up in a batch (issues flagged for errors)
- Con: Some files are done separately from the "official" derivate job (JobTypeMakeDerivatives), e.g., the METS file, which is built differently depending on processes.
- Pro: If done right, this should get us all the big files at least, and well before any batch actions are taken
New issue scanner that just makes SHA sums for all issues' files?
- Con: have to add a weird non-job scanner, like the page review scanner, to just regularly look at things that need sums
- Con: Needs to grab issues with care: cannot be running on issues which are about to be moved/processed by another job, which means this ideally would be its own job, not just a scanner, but that brings us back to some of the difficulties of having this happen in the derivative generation queue. This also means it'd have to grab issues and store any relevant state. I am liking this option less and less.
- Neutral: need to make sure it only processes issues past a certain point in the process. Metadata entry? Review? Awaiting batching?
New job queued after metadata curation / review?
- Con: doesn't alleviate much of the overall burden: the problem we're trying to solve is having a lot of I/O-heavy jobs that slow down other things on the system. Metadata processes tend to happen in clusters, and fairly quickly, so we'd still end up with a bottleneck somewhere, and one that might be a lot less predictable.
- Pro: Even clustered, this still would probably be less of a bottleneck overall compared to generating batches. A typical batch generation requires thousands of SHA sums all at once.
- Neutral: could have two jobs here: after curation, get sums for the PDFs, then after metadata review get sums for the rest? Would this split up really be worth the effort, though?
New process within various derivative processing functions to do a SHA cache on the generated files?
- Con: definitely requires SHA computation before batch build, in order to catch files this won't
- Con: doesn't manage the TIFFs, which is the biggest perf hit; would still need something else to manage these
- Con: Feels wrong

To avoid missing files, and handle files generated only when a batch is created, we still need to compute SHAs on batch generation. We would have to adjust that to first look for a cached value, and only if that's missing would it do a SHA. This would have to be done carefully, though, to ensure we don't hold onto cached data too long.

jechols commented 6 months ago

This is going to require a change to the uoregon-libraries/gopkg/bagit package. We'll need to either make a bagit-specific Hasher that takes a path (right now it's just there to allow different hash implementations based on hash.Hash) or supply the bagit object with precomputed hashes. Or maybe have a lookup function in bagit.Bag for getting the hash, which defaults to use the bagit.Hasher if the lookup fails, so that non-NCA uses of this package work without changes?

jechols commented 6 months ago

The plan is now firming up in my head based on some tweaks I've done to our gopkg project:

After metadata curation, create a job to build SHA256-hashed manifest (uoregon-libraries/gopkg/fileutil/manifest).
Make sure manifests are copied around when batch files are created.
Verify the new bagit "cache" is reading these manifests while still generating checksums for files that don't have a manifest - this is critical for catching files which are only created when a batch is built as well as letting us deploy without trying to "drain the queue" entirely, which would be a practical impossibility in our environment where we get hundreds of new issues a week.

Simple!

uoregon-libraries / newspaper-curation-app

For bagit, calculate SHA hashes in advance #134