Should calculating a checksum on a file be cached?

rpm-software-management / librepo

A library providing C and Python (libcURL like) API for downloading packages and linux repository metadata in rpm-md format

http://rpm-software-management.github.io/librepo/

GNU Lesser General Public License v2.1

74 stars 91 forks source link

Should calculating a checksum on a file be cached? #235

Open malmond77 opened 3 years ago

malmond77 commented 3 years ago

See https://github.com/rpm-software-management/librepo/pull/234/files/2743583e444d745526e9bb8fad249ce6ea08e0f9#r594141462 for context

@m-blaha suggests that all calculated checksums should be eligible for caching, if caching == TRUE when calling lr_checksum_fd_compare. I tend to agree, but the cache key does not indicate which hash algorithm was used, so we need to resolve that first. If we don't, an md5 checksum would be cached, which would never match a sha256 for example.

m-blaha commented 3 years ago

The keys (xattrs names) has changed recently (see https://github.com/rpm-software-management/librepo/pull/232/commits/7a7cd445b35f9b753caf6c8d38eed7e4c7cb14c3) Originally the cache key was something like user.Zif.MdChecksum[123456] where 123455 was mtime of the file. This caused problems precisely because the hash algorithm was not stored. Now we use two keys instead - user.Librepo.checksum.mtime for the timestamp and a set of user.Librepo.checksum.HASH_TYPE for actual checksums. HASH_TYPE is most often sha256, but could vary according the hash used in particular repositories.

malmond77 commented 3 years ago

@m-blaha : awesome - I didn't pick up on the keys being fixed in this way before. Now that this is resolved, what is your opinion? Is this safe to implement?

m-blaha commented 3 years ago

I think it is. Once the right checksum of the file is calculated, it can be saved in xattr for future usage.