Open kyle0r opened 9 months ago
very nice, however i think its a bit tricky to make the default algorithm dynamic: if one node has xxhash and the other node hasn't , it will look to the user as if all the checksums mismatch.
so maybe keep the default a safe sha1? we can add information to the performance tips page about installing the xxhash and select it?
very nice, however i think its a bit tricky to make the default algorithm dynamic: if one node has xxhash and the other node hasn't , it will look to the user as if all the checksums mismatch.
so maybe keep the default a safe sha1? we can add information to the performance tips page about installing the xxhash and select it?
That is fair. I had not considered it because I'm used to being cognitive of the technical details of the checksum etc.
Let me have a think. I agree it makes sense for the out of box / default behaviour to be very predictable. Best user experience etc.
Perhaps stderr output should contain checksum details and a hint to performance tips?
Additional thoughts. Perhaps the stdout should contain a predictable header (defining the specification of the operation - how zfs-check
was invoked) which contains the spec of the invocation, the options/arguments used, block, hash etc and this would form part of the comparison? In the end two invocations with different arguments would likely produce different results.
This approach is also nice for the use case where someone stores the checksums long term so they can be aware of the the spec that was used to generate the checksums? Perhaps offering a generated cli invocation which could be copy + pasted for repeatability?
Hypothetically, in the --check
use case, this header if present, could be parsed by zfs-check
to invoke a like for like output?
That sounds good, first line as a comment with the "hashing" config? (Things like the hash and chunksize and skip option?)
That sounds good, first line as a comment with the "hashing" config? (Things like the hash and chunksize and skip option?)
Yes, along those lines. I was thinking that we could make it an object like json so its easy to parse. I think using the argparse
module we could output the exact invocation and this could also be feed back into zfs-check --check
to invoke a like-for-like invocation, and/or be copy-pasted by the user. In the json object additional metadata could be added like invocation timestamp, version, stuff like that?
It looks like the BlockHasher.compare()
method is reading the --check
file handle line by line, so it should be fairly straightforward to check if the first line contains the checksum spec header?
You've probably seen that a new OpenZFS silent corruption bug has recently surfaced (silent for ~18 months?!), I'm currently doing analysis on the available info to see how I might be impacted and how to "check" it 😉 see: https://github.com/openzfs/zfs/issues/15526 and https://redd.it/1826lgs. I pinged you a chat on reddit if you want to chat about this PR or the corruption.
oh no another silent corruption bug in zfs? :(
i think that the header should be read earlier. Blockhasher is a low level function, and by then its too late to change any settings? but i have to look into it.
Coincidently the datacorruption bug is also related to sparse handling of files it seems. They just released a fix: https://github.com/openzfs/zfs/releases/tag/zfs-2.1.14
Per #225, this PR offers a solution - critique and changes very welcome.
The PR introduces two main changes:
xxhash
module is available,xxh3_64
is becomes the default--hash
algorithm, else fallback tosha1
.zfs-check --hash
which enables invocation with a specific hash algorithm.This PR proposes to use xxHash - an extremely fast non-cryptographic hash algorithm, as the default hash algorithm for
zfs-check
. This algorithm is able to hash sparse chunks efficiently. This previous revisions hardcodedsha1
algorithm was inefficient in handling sparse chunks. This is documented in #225.One is able to get a quick idea of the performance of various hash algorithms here: https://github.com/Cyan4973/xxHash#benchmarks. Its clear to see
sha1
is near the bottom of the leader-board.💡 It might be desirable to add
xxhash
torequirements.txt
BUT the caveat is that operating system needs sharedlibxxhash
installed. I've omitted this for now, deferring to the leaders of the project to make a decision on this.To manually install
xxhash
in a python env:python3 -m pip install --upgrade xxhash
.Note that the code in the PR gracefully handles the absence of the
xxhash
module and implements fallback logic.BlockHasher.py
ZfsCheck.py
--hash
. The choices for the arg are generated based on what is detected in the python env.--hash
is is validated against the arg choices.