Greetings,

This is a feature/enhancement request for zfs-check, and a braindump on possible implementation.

There has been discourse on this topic in the past 2022-June on reddit regarding this topic. Screenshot for quick reference and here is a link to the thread:

click to unfold screenshot

Having recently restored some datasets from backups, I had reason to do some checksums on various parts of the data, and zfs-check came back to my mind as a nice tool for this.

I checked the latest code in the repo and didn't see any support for sparse file handling yet, so I thought I'd make this request and provide some details. If I get the time I will make a PR for it myself, but in the meantime here is the request/info.

To illustrate what I'm referring to take for example a raw GPT partition file with XFS partition(s) stored on ZFS dataset. Example creation command qemu-img create -f raw test.raw 4T then the usual gdisk commands to create the GUID partition table, and mkfs.xfs -m crc=1,reflink=1 to create a filesystem partition.

This set of commands would create a sparse 4TiB file, in my case stored on a zfs dataset, create the partition table, and format a partition with XFS and this raw disk would be provisioned on a kvm for example running on proxmox.

Today when zfs-check checks such a file it will chunk up and iterate over all the bytes in the sparse file. It will not treat the sparse blocks differently to normal blocks. Therefore it will actually read 4TiB bytes and compute the checksums for the chunks etc. This was covered in the reddit post from 2022-June (see above).

What is a sparse file? A sparsely allocated file is a file that contains one or more HOLE(s). A HOLE is a sequence of zeros that (normally) has not been allocated in the underlying file storage. Cite: https://linux.die.net/man/2/lseek

As a sanity check for the post in 2022-June I checked how long it takes to "optimally read" the sparse store1 4TiB raw disk, with its normal allocation of ~2.6TiB the remainder sparse allocation.

4TiB raw disk	read duration
Sparse optimised	5-6 hours
`zfs-check` read	8-9 hours

So, by my calculation there is definitely room for optimisation in zfs-check in this regard.

An example of common utilities handling sparse files efficiently as follows, consider using dd to read a 4TiB sparse file in the following way and xxh64sum to create the checksum:

root@viper:~# dd iflag=direct bs=1M if=/store5b/data/vm/raw/.zfs/snapshot/2023-07-24-blah/images/102/vm-102-disk-0.raw status=progress | xxh64sum
4397679509504 bytes (4.4 TB, 4.0 TiB) copied, 21016 s, 209 MB/s
4194304+0 records in
4194304+0 records out
4398046511104 bytes (4.4 TB, 4.0 TiB) copied, 21016.2 s, 209 MB/s
7e2042fc6c4c43d9  stdin

OK, nothing special on face value BUT when we study the details during the operation using pv to monitor the relevant file descriptor, the output format is $(date) %{Current data transfer rate} %{Average data transfer rate} %{Bytes transferred so far}:

root@viper:~# { while ps -p 3331237 >/dev/null 2>&1; do st=$(date); timeout 2m pv -d 3331237:0  -F "${st} %t %r %a %b"; done }
Wed Oct 11 02:21:27 AM UTC 2023 0:01:59 [ 102MiB/s] [ 100MiB/s]  222GiB
Wed Oct 11 02:23:27 AM UTC 2023 0:01:59 [ 104MiB/s] [ 102MiB/s]  234GiB
Wed Oct 11 02:25:27 AM UTC 2023 0:01:59 [93.6MiB/s] [ 103MiB/s]  246GiB
Wed Oct 11 02:27:27 AM UTC 2023 0:01:59 [ 105MiB/s] [ 100MiB/s]  258GiB
Wed Oct 11 02:29:27 AM UTC 2023 0:01:59 [ 888MiB/s] [ 255MiB/s]  288GiB
Wed Oct 11 02:31:27 AM UTC 2023 0:01:59 [ 144MiB/s] [ 222MiB/s]  314GiB
Wed Oct 11 02:33:27 AM UTC 2023 0:01:59 [ 143MiB/s] [ 128MiB/s]  329GiB
Wed Oct 11 02:35:27 AM UTC 2023 0:01:59 [ 121MiB/s] [ 130MiB/s]  345GiB
Wed Oct 11 02:37:27 AM UTC 2023 0:01:59 [ 128MiB/s] [ 151MiB/s]  362GiB
Wed Oct 11 02:39:27 AM UTC 2023 0:01:59 [1.86GiB/s] [ 932MiB/s]  471GiB
Wed Oct 11 02:41:27 AM UTC 2023 0:01:59 [1.90GiB/s] [1.94GiB/s]  704GiB
Wed Oct 11 02:43:27 AM UTC 2023 0:01:59 [1.93GiB/s] [1.92GiB/s]  934GiB
Wed Oct 11 02:45:27 AM UTC 2023 0:01:59 [ 189MiB/s] [ 867MiB/s] 1.01TiB
Wed Oct 11 02:47:28 AM UTC 2023 0:01:59 [ 112MiB/s] [ 146MiB/s] 1.03TiB
Wed Oct 11 02:49:28 AM UTC 2023 0:01:35 [ 164MiB/s] [ 135MiB/s] 1.04TiB

What you see here is a human readable log containing 2 minute samples of the progress and data rates for the dd operation.

Note the 3rd column, the %{Average data transfer rate}. You'll notice how sometimes a sample is few hundred MiB/s which is determined by the physical limits of the underlying vdev(s) read speed when reading normally allocated blocks, and sometimes you'll see ~2GiB/s which is obviously an order of magnitude faster. The high speed samples document where dd, its libraries, the kernel and OpenZFS have conspired together to efficiently process the sparse blocks and sections of the the sparse 4TiB raw partition file.

NB: In my example I used xxHash by Yann Collet because as per the name: Extremely fast non-cryptographic hash algorithm (github link) its modern, maintained and it can provide "near RAM speed" hashing speed. zfs-check defaults to sha1 which has limited performance and perhaps not best suited for sparse file handling. Note the hash performance comparison table on the xxHash GitHub README.

The sparse allocation optimisation is also reflected in the 4th column %{Bytes transferred so far}: in the samples that illustrate reading of the sparse allocated blocks, the amount of logical data read per sample is naturally a lot more than the samples for normally allocated blocks.

Outcome: progress in reading through the file is much quicker (than zfs-check) because the sparse allocations/sections are handled efficiently.

A quick note on the system in my examples. Its a entry level enterprise storage node running proxmox 7.x which is Debian bookworm based and running the Ubuntu 6.2.x kernel. 12 physical Xeon CPU cores over two sockets and 128 GB RAM.

For my dd example, what I've not yet understood is which code (or code combo) in the chain is responsible for the sparse file reading optimisation. From glancing at the the dd code it doesn't seem that the optimisation is happening at that level, and you can see from my dd example I'm not giving it special arguments to handle sparse files. I think the optimisation is more likely to be located in lower level read() and seek() functions and a combination of the filesystem support for sparse file allocation. OpenZFS definitely supports efficient sparse file reads as demonstrated in the dd example provided.

On Debian the coreutils package which contains dd source might have some clues: https://sources.debian.org/src/coreutils/9.1-1/. I took a quick look but didn't find a smoking gun, in fact dd code is primarily using lseek (fd, offset, SEEK_CUR) which means I missed something or the optimisation is happening actually in the kernel/filesystem code (for example fs/read_write.c here) or in OpenZFS code. NB: Proxmox uses the Ubuntu kernel.

For reference and perhaps part of the solution for zfs-check, for python there is a useful but probably not exactly what is needed answer here: https://stackoverflow.com/a/62312222. This approach seeks the entire file first to determine the ranges of DATA and HOLE(s).

From what I can tell zfs_autobackup/BlockHasher.py will need an logic update something like the following:

Ensure logic exists somewhere to skip using "sparse logic" on file descriptors like pipes that don't support seeking backwards or hole detection.
lseek(fd, offset, SEEK_HOLE) to determine if the offset is the end of the file
- IF TRUE offset == EOF then the file contains no HOLE(s) and is not sparse - rewind fd - the existing logic can be used
- IF FALSE offset == EOF then the file contains at least one HOLE - use sparse logic SPARSE LOGIC:
Start with lseek(fd, offset, SEEK_DATA) until a HOLE chunk is detected? (a chunk that is sequence of zeros)
HOLE DETECTED, lseek(fd, offset, SEEK_HOLE) iterate forward seeking n chunks to find the end of the HOLE.
- if the return == offset then still in a contiguous hole
  - keep iterating and seeking forwards, counting the nr_contiguous_hole_chunks
- if the return != offset and != EOF then there is some more DATA to process
  - compute the checksum of the HOLE? zeros x nr_contiguous_hole_chunks x chunk_size
  - rewind fd to resume processing DATA at the end offset of the last detected contiguous HOLE chunk
- if the return == EOF (there are no more HOLES) then rewind fd to resume processing DATA at the end offset of the last detected contiguous HOLE chunk. The logic then iterates of the remaining DATA.

It would be interesting to find examples of other code that have implemented similar logic to sanity check this logic and check for any inefficiencies or flaws.

💡 IDEA: Maybe its enough to switch the zfs_autobackup/BlockHasher.py from using fh.seek(pos) to using os.lseek(fd, pos, whence) and using xxhash rather than the slower sha1 and the optimisation would be automatically performed at a lower level? As per the dd example which is using lseek (fd, offset, SEEK_CUR) and no mention of SEEK_DATA or SEEK_HOLE. This might be worth a try before trying to invent new logic per my brain dump herein.

python ref for os.lseek(): https://docs.python.org/3/library/os.html#os.lseek
python ref for fd.seek(): https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

Thanks for reading, look forward to any comments or critique.

Cheers

Kyle

After studying the zfs_autobackup/BlockHasher.py code in a bit more detail I couldn't help but thinking that it "looks OK". In theory sparse file optimisation should be performed at a lower level like with my dd example (assuming I'm right on that for now).

I thought to make a simple change first, and changed the hash from sha1 to xxh64 and this seems to give a positive result.

So, I followed the Development wiki page and made the following change:

You can see in the following quick sample that zfs-check was able to hit >2GiB/s during some of the samples which usually correlates to when a program reading a sparse file encounters a sparsely allocated area aka a HOLE. With my simple change xxh64 was able to efficiently hash the zeros and boost the read performance of zfs-check.

Thu Oct 12 02:02:41 AM UTC 2023 0:01:59 [ 141MiB/s] [ 102MiB/s]  389GiB
Thu Oct 12 02:04:41 AM UTC 2023 0:01:59 [ 104MiB/s] [ 190MiB/s]  411GiB
Thu Oct 12 02:06:41 AM UTC 2023 0:01:59 [3.71GiB/s] [2.03GiB/s]  653GiB
Thu Oct 12 02:08:41 AM UTC 2023 0:01:59 [ 260MiB/s] [3.11GiB/s] 1.00TiB
Thu Oct 12 02:10:41 AM UTC 2023 0:01:59 [ 153MiB/s] [ 166MiB/s] 1.02TiB
Thu Oct 12 02:12:41 AM UTC 2023 0:01:59 [ 129MiB/s] [ 138MiB/s] 1.04TiB

As another quick sanity check to test the performance difference of sha1 to xxh64 I performed these tests:

So its clear to see that sha1 is a significant bottleneck.

My conclusion right now is that zfs-check is actually indirectly able to handle and read sparse files efficiently, assuming that the relevant library, kernel and filesystem code supports efficient sparse file IO. The read performance seems to of been limited by sha1.

Something else that can make a significant performance difference is the zfs-check options --block-size and --count.
In my testing above I used --block-size 1024000 --count 128 aka 128 1MiB blocks per chunk-hash and zfs-check should read 1MiB blocks at a time in this example.

Comments and critique welcome.

sounds interesting. i'm all for changing the default hash algo to something like xxh64 if thats not a too obscure library. we're still in development/beta anyway, and zfs-check isnt widely used, so we can make big changes.

same goes for a different default block-size and count, i didnt do much testing to choose that. (just guestimated values)

its very cool actually, this is very usefull to test huge volumes :)

Hey. Acknowledged. It might be useful to offer the user control of the hash class via an option. And perhaps default to a (currently) secure but faster hash like sha512.

For example offering good options like --hash=secure and --hash=fast and also allow a specific hash to be specified. Defaulting to --hash=secure?

Unrelated to sparse files, from a security perspective it might be desirable to ask the user for a hash "pepper" which would make the computed hash lists more secure and disconnected from the contents of the original blocks. And/Or perhaps warn users about the potential risks of non salted or peppered hashes.

Thinking out loud it would be possible to make further optimisations for reading sparse blocks, similar to the sparse logic I outlined in post 1. Hole chunks have identical checksums and therefore computation of the zeros can be skipped and be replaced by pre-computed values? This might be overly complicated however and be a source of bugs. So it's perhaps a risk vs. rewards decision.

i think you're confusing password hash security with block-hash security.

even with a chunksize of only 1MB, it would be impossible to guess the original file from a hash i think?

yes, it might be possible to create a hash collision if you use a insecure hash function, which is bad in case of password security and authentication.

but a collision is no problem in our case.

i tried to calculate the odds of guessing a 1MB chunk, from a hash, but python gave up when the number had 4300 digits :P

(so i dont think we need to make it configurable yet)

i think you're confusing password hash security with block-hash security.

even with a chunksize of only 1MB, it would be impossible to guess the original file from a hash i think?

yes, it might be possible to create a hash collision if you use a insecure hash function, which is bad in case of password security and authentication.

but a collision is no problem in our case.

My thinking certainly comes from the security of password hashes and content token hashes, for example content tokens used to tokenise GDPR sensitive data sets. From an InfoSec best practice perspective, I thought it might be good to increase the improbability and decouple the chunk hashes from the actual chunk bytes. Yes, the probability of collisions is low for a given call to zfs-check. I was looking at it more from the perspective that if someone had a bunch of chunk-hashes for known chunks (for a given --block-size and --count, a factor that further increases the improbability for non-default values), those chunk-hashes could potentially be used to check against someone else's chunk-hashes and if there were collisions - obviously then someone would know the contents of the given chunk-hash. Adding a pepper or some other measure would further increase the improbability of seeing collisions and separate the chunk-hashes from the actual chunk bytes. I hope I have explained this OK. It is certainly an InfoSec paranoia consideration. It comes from my "secure by design" development background.

Back on to the main topic

I ran my tweaked zfs-check against my store5 and store5-backup pools for the following 4TiB sparse raw GPT partition file containing XFS filesystem stored on ZFS dataset.

The jobs took approx. 6.08 and 5.51hours respectively.

Here are some samples from the left job

...
Thu Oct 12 06:26:17 PM UTC 2023 0:01:59 [76.1MiB/s] [ 141MiB/s]  368GiB
Thu Oct 12 06:28:17 PM UTC 2023 0:01:59 [3.71GiB/s] [2.92GiB/s]  716GiB
Thu Oct 12 06:30:17 PM UTC 2023 0:01:59 [ 208MiB/s] [2.61GiB/s] 1.01TiB
Thu Oct 12 06:32:17 PM UTC 2023 0:01:59 [ 138MiB/s] [ 156MiB/s] 1.02TiB
...
Thu Oct 12 11:34:29 PM UTC 2023 0:01:59 [83.7MiB/s] [ 105MiB/s] 3.35TiB
Thu Oct 12 11:36:29 PM UTC 2023 0:01:59 [66.1MiB/s] [ 102MiB/s] 3.36TiB
Thu Oct 12 11:38:29 PM UTC 2023 0:01:59 [ 113MiB/s] [ 104MiB/s] 3.37TiB
Thu Oct 12 11:40:29 PM UTC 2023 0:01:59 [3.71GiB/s] [1.39GiB/s] 3.53TiB
Thu Oct 12 11:42:29 PM UTC 2023 0:01:59 [3.98GiB/s] [3.77GiB/s] 3.97TiB
Thu Oct 12 11:44:29 PM UTC 2023 0:00:05 [3.95GiB/s] [3.95GiB/s] 4.00TiB

Here are some samples from the right job

...
Thu Oct 12 06:27:35 PM UTC 2023 0:01:59 [ 110MiB/s] [ 133MiB/s]  365GiB
Thu Oct 12 06:29:35 PM UTC 2023 0:01:59 [3.61GiB/s] [2.04GiB/s]  607GiB
Thu Oct 12 06:31:35 PM UTC 2023 0:01:59 [ 147MiB/s] [3.47GiB/s] 1.00TiB
Thu Oct 12 06:33:35 PM UTC 2023 0:01:59 [ 141MiB/s] [ 182MiB/s] 1.02TiB
...
Thu Oct 12 11:01:45 PM UTC 2023 0:01:59 [92.3MiB/s] [88.1MiB/s] 3.35TiB
Thu Oct 12 11:03:45 PM UTC 2023 0:01:59 [83.7MiB/s] [88.1MiB/s] 3.36TiB
Thu Oct 12 11:05:45 PM UTC 2023 0:01:59 [76.0MiB/s] [89.0MiB/s] 3.37TiB
Thu Oct 12 11:07:45 PM UTC 2023 0:01:59 [3.63GiB/s] [1.51GiB/s] 3.55TiB
Thu Oct 12 11:09:45 PM UTC 2023 0:01:59 [3.97GiB/s] [3.74GiB/s] 3.99TiB
Thu Oct 12 11:11:46 PM UTC 2023 0:00:02 [3.83GiB/s] [3.83GiB/s] 4.00TiB

So its clear to see that zfs-check existing code + a fast hash class is capable of relatively good sparse file chunk-hashing.

Good news (for my data at least 😉), both snapshots appear to be integral which is nice given my recent events loosing a disk etc. The pools and datasets have slightly different configuration, specifically around the compression and checksum settings.

root@viper:~# md5sum /store5/data/vm/raw/summer-checksums.txt /store5-backup/data/vm/raw/summer-checksums.txt
f7f6164de7edd861ea2e0eb55176dc39  /store5/data/vm/raw/summer-checksums.txt
f7f6164de7edd861ea2e0eb55176dc39  /store5-backup/data/vm/raw/summer-checksums.txt

A note on HOLE chunks which in this zfs-check invocation always had the checksum of 25a95bba19ee2310. There were 15234 HOLE chunks which in theory could of been determined by checking if the chunk was a HOLE before computing the hash. In theory the determination would be faster than computation. i.e. skipping the read() and calling lseek(fd, offset, SEEK_HOLE) until the next DATA chunk is detected and output the known HOLE checksum for the HOLE chunks rather than computing the hash of the zeros.

root@viper:~# pv /store5/data/vm/raw/summer-checksums.txt | awk '{ print $2 }' | sort | uniq -c | sort -k1nr,1 |head
 742KiB 0:00:00 [91.9MiB/s] [==================================================================>] 100%
  15234 25a95bba19ee2310
      1 0000cf2484c5a0fb
      1 0004d6fe7593acfe

Sanity check on the HOLE chunk checksum:

root@viper:~# dd if=/dev/zero bs=1024000 count=128 status=progress | xxh64sum
128+0 records in
128+0 records out
25a95bba19ee2310  stdin

Math bug: I used duckduckgo or google to convert 1 MiB to bytes and didn't verify the answer was correct and for some reason ended up with 1024000 bytes (rounding issue?) but the correct result is 1MiB = 1048576 (1024^2), the answer I got and used was missing 24 KiB ¯\(ツ)/¯. Its not a deal breaker for this testing and doesn't invalidate anything. One potential negative is sub optimal unaligned block read I/O but in theory ZFS I/O scheduler should smooth most of that out anyway. netgraph showed the average I/O operation size was consistently ~1MiB:

Next steps?

So where do we go from here?

I'm not sure I have a nice environment to run the docker test stuff without having to create something. Would you prefer to take ownership and make the changes to offer a --hash option to the user?

I could offer to create a lxc to run docker on my lab and try and get the test stuff working and then make a PR? This might get delayed because tbh messing around with lxc and docker isn't my favourite pass-time! Its been a PITA in the past.

How are you running docker and the tests in your development setup?

ah i understand where you're coming from in that case. for now, lets not add salts to keep things simple.

just create a pullrequest that changes the default hash. if you really want a --hash option feel free to add it. dont worry about the test too much. also: github will run the testsuite when you create a pull request.

Acknowledged. I'll put a PR together in the near future.

A quick note from further testing. It would appear 4.21GiB/s would appear to the upper CPU boundary on my system for a single zfs-check python thread.
Spec: 2x Xeon CPU E5-2620 v3 (2.4GHz+, Haswell Q3'14). 6 cores per CPU = 12 cores (24 threads) 15 MB Cache
128 GB DDR4 ECC RAM @1866MT/s.

Hypothetically zfs_autobackup/BlockHasher.py could be adapted for multithreading and X number of chunks could be read in serial and then X number of checksummed in parallel, obviously paying attention to retain the output order.

psy0rz / zfs_autobackup

zfs-check: efficient handling of sparse files #225

Next steps?