Fixes #225 zfs-check: efficient handling of sparse files

kyle0r commented 9 months ago

Per #225, this PR offers a solution - critique and changes very welcome.

The PR introduces two main changes:

If xxhash module is available, xxh3_64 is becomes the default --hash algorithm, else fallback to sha1.
Implemenation of new argument zfs-check --hash which enables invocation with a specific hash algorithm.

This PR proposes to use xxHash - an extremely fast non-cryptographic hash algorithm, as the default hash algorithm for zfs-check. This algorithm is able to hash sparse chunks efficiently. This previous revisions hardcoded sha1 algorithm was inefficient in handling sparse chunks. This is documented in #225.

One is able to get a quick idea of the performance of various hash algorithms here: https://github.com/Cyan4973/xxHash#benchmarks. Its clear to see sha1 is near the bottom of the leader-board.

💡 It might be desirable to add xxhash to requirements.txt BUT the caveat is that operating system needs shared libxxhash installed. I've omitted this for now, deferring to the leaders of the project to make a decision on this.
To manually install xxhash in a python env: python3 -m pip install --upgrade xxhash.

Note that the code in the PR gracefully handles the absence of the xxhash module and implements fallback logic.

BlockHasher.py

hash_class is sourced from cli args instead of hardcoding it.
hash_factory() lays the groundwork to support arbitrary hash libs.
Detection of and use of xxhash lib.

ZfsCheck.py

Implement new cli arg --hash. The choices for the arg are generated based on what is detected in the python env.
The input to --hash is is validated against the arg choices.
Implemented helper method determine_algorithms_available(). This tries to pick a performant default with a fallback to sha1.
Detection of and use of xxhash lib.

psy0rz commented 9 months ago

very nice, however i think its a bit tricky to make the default algorithm dynamic: if one node has xxhash and the other node hasn't , it will look to the user as if all the checksums mismatch.

so maybe keep the default a safe sha1? we can add information to the performance tips page about installing the xxhash and select it?

kyle0r commented 9 months ago

very nice, however i think its a bit tricky to make the default algorithm dynamic: if one node has xxhash and the other node hasn't , it will look to the user as if all the checksums mismatch.

so maybe keep the default a safe sha1? we can add information to the performance tips page about installing the xxhash and select it?

That is fair. I had not considered it because I'm used to being cognitive of the technical details of the checksum etc.

Let me have a think. I agree it makes sense for the out of box / default behaviour to be very predictable. Best user experience etc.

Perhaps stderr output should contain checksum details and a hint to performance tips?

kyle0r commented 9 months ago

Additional thoughts. Perhaps the stdout should contain a predictable header (defining the specification of the operation - how zfs-check was invoked) which contains the spec of the invocation, the options/arguments used, block, hash etc and this would form part of the comparison? In the end two invocations with different arguments would likely produce different results.

This approach is also nice for the use case where someone stores the checksums long term so they can be aware of the the spec that was used to generate the checksums? Perhaps offering a generated cli invocation which could be copy + pasted for repeatability?

Hypothetically, in the --check use case, this header if present, could be parsed by zfs-check to invoke a like for like output?

psy0rz commented 9 months ago

That sounds good, first line as a comment with the "hashing" config? (Things like the hash and chunksize and skip option?)

kyle0r commented 9 months ago

That sounds good, first line as a comment with the "hashing" config? (Things like the hash and chunksize and skip option?)

Yes, along those lines. I was thinking that we could make it an object like json so its easy to parse. I think using the argparse module we could output the exact invocation and this could also be feed back into zfs-check --check to invoke a like-for-like invocation, and/or be copy-pasted by the user. In the json object additional metadata could be added like invocation timestamp, version, stuff like that?

It looks like the BlockHasher.compare() method is reading the --check file handle line by line, so it should be fairly straightforward to check if the first line contains the checksum spec header?

You've probably seen that a new OpenZFS silent corruption bug has recently surfaced (silent for ~18 months?!), I'm currently doing analysis on the available info to see how I might be impacted and how to "check" it 😉 see: https://github.com/openzfs/zfs/issues/15526 and https://redd.it/1826lgs. I pinged you a chat on reddit if you want to chat about this PR or the corruption.

psy0rz commented 9 months ago

oh no another silent corruption bug in zfs? :(

i think that the header should be read earlier. Blockhasher is a low level function, and by then its too late to change any settings? but i have to look into it.

psy0rz commented 9 months ago

Coincidently the datacorruption bug is also related to sparse handling of files it seems. They just released a fix: https://github.com/openzfs/zfs/releases/tag/zfs-2.1.14

psy0rz / zfs_autobackup

Fixes #225 zfs-check: efficient handling of sparse files #234