iftechfoundation / ifarchive-admintool

Admin script for IF Archive work
1 stars 1 forks source link

Cache md5 checksums #33

Closed erkyrath closed 8 months ago

erkyrath commented 8 months ago

We spend a lot of time reading the md5 of a file -- that hash is used as a key for the uploads table.

(Every time you view the upload info for a file, move a file, rename a file, zip a file.) (If we want to display upload info in a directory listing, a la https://github.com/iftechfoundation/ifarchive-admintool/issues/18 , that's even more md5 hashing.)

This is significant for large files. Currently there's 3GB of "snapshot-20XX.tgz" sitting in Unprocessed. Hashing those files takes a total of about 8-9 seconds on the current setup.

Hashing on (pathname, filesize, modtime) would be entirely sufficient. If we kept an (in-memory) cache mapping that triple to md5, we could reduce most hash checks to a stat() call, rather than reading the entire file.

(I want to keep the logic of the main app code reading an md5 hash for database lookups. This DB layout is also used in upload.py; it's generally useful, just sometimes slow.)

(Expire old entries on a time cycle of a week, say, just to avoid the risk of cache bloat.)

erkyrath commented 8 months ago

Done. The cache is per-process, and Apache likes to launch eight server processes, so you're not guaranteed to never see a slow page twice in a row. You could see a slow page load eight times. But it's better than not caching.