Closed jechols closed 5 months ago
Of course this begs the question: where do we put these hashes? Database? Filesystem?
Recently we had an issue with this again; time to spend a little bit to see if any quick wins can be found
Thoughts:
Queue...
functions make derivativesJobTypeMakeDerivatives
), e.g., the METS file, which is built differently depending on processes.To avoid missing files, and handle files generated only when a batch is created, we still need to compute SHAs on batch generation. We would have to adjust that to first look for a cached value, and only if that's missing would it do a SHA. This would have to be done carefully, though, to ensure we don't hold onto cached data too long.
This is going to require a change to the uoregon-libraries/gopkg/bagit
package. We'll need to either make a bagit-specific Hasher that takes a path (right now it's just there to allow different hash implementations based on hash.Hash
) or supply the bagit object with precomputed hashes. Or maybe have a lookup function in bagit.Bag
for getting the hash, which defaults to use the bagit.Hasher
if the lookup fails, so that non-NCA uses of this package work without changes?
The plan is now firming up in my head based on some tweaks I've done to our gopkg project:
uoregon-libraries/gopkg/fileutil/manifest
).Simple!
Problem: it takes ages to build a batch's bagit data, especially over a network-mounted filesystem. This is almost entirely due to the amount of time it takes to run the SHA hashing of all the files.
If we compute each issue's hash data before batches are generated (when files are created or pulled into NCA for the first time? After metadata review happens?), the cost is spread out instead of being incurred all at once. This would also make a dramatic difference when we have to reject a single issue and then requeue a batch. Waiting for thousands of SHAs to be computed that had already been computed previously is just painful.