uoregon-libraries / newspaper-curation-app

Suite of front- and back-end tools for the curation of digitized newspaper materials
Apache License 2.0
8 stars 1 forks source link

For bagit, calculate SHA hashes in advance #134

Closed jechols closed 5 months ago

jechols commented 3 years ago

Problem: it takes ages to build a batch's bagit data, especially over a network-mounted filesystem. This is almost entirely due to the amount of time it takes to run the SHA hashing of all the files.

If we compute each issue's hash data before batches are generated (when files are created or pulled into NCA for the first time? After metadata review happens?), the cost is spread out instead of being incurred all at once. This would also make a dramatic difference when we have to reject a single issue and then requeue a batch. Waiting for thousands of SHAs to be computed that had already been computed previously is just painful.

jechols commented 3 years ago

Of course this begs the question: where do we put these hashes? Database? Filesystem?

jechols commented 6 months ago

Recently we had an issue with this again; time to spend a little bit to see if any quick wins can be found

jechols commented 6 months ago

Thoughts:

To avoid missing files, and handle files generated only when a batch is created, we still need to compute SHAs on batch generation. We would have to adjust that to first look for a cached value, and only if that's missing would it do a SHA. This would have to be done carefully, though, to ensure we don't hold onto cached data too long.

jechols commented 6 months ago

This is going to require a change to the uoregon-libraries/gopkg/bagit package. We'll need to either make a bagit-specific Hasher that takes a path (right now it's just there to allow different hash implementations based on hash.Hash) or supply the bagit object with precomputed hashes. Or maybe have a lookup function in bagit.Bag for getting the hash, which defaults to use the bagit.Hasher if the lookup fails, so that non-NCA uses of this package work without changes?

jechols commented 6 months ago

The plan is now firming up in my head based on some tweaks I've done to our gopkg project:

Simple!