Open kyle0r opened 11 months ago
After studying the zfs_autobackup/BlockHasher.py
code in a bit more detail I couldn't help but thinking that it "looks OK". In theory sparse file optimisation should be performed at a lower level like with my dd
example (assuming I'm right on that for now).
I thought to make a simple change first, and changed the hash from sha1
to xxh64
and this seems to give a positive result.
So, I followed the Development wiki page and made the following change:
You can see in the following quick sample that zfs-check
was able to hit >2GiB/s during some of the samples which usually correlates to when a program reading a sparse file encounters a sparsely allocated area aka a HOLE. With my simple change xxh64
was able to efficiently hash the zeros and boost the read performance of zfs-check
.
Thu Oct 12 02:02:41 AM UTC 2023 0:01:59 [ 141MiB/s] [ 102MiB/s] 389GiB
Thu Oct 12 02:04:41 AM UTC 2023 0:01:59 [ 104MiB/s] [ 190MiB/s] 411GiB
Thu Oct 12 02:06:41 AM UTC 2023 0:01:59 [3.71GiB/s] [2.03GiB/s] 653GiB
Thu Oct 12 02:08:41 AM UTC 2023 0:01:59 [ 260MiB/s] [3.11GiB/s] 1.00TiB
Thu Oct 12 02:10:41 AM UTC 2023 0:01:59 [ 153MiB/s] [ 166MiB/s] 1.02TiB
Thu Oct 12 02:12:41 AM UTC 2023 0:01:59 [ 129MiB/s] [ 138MiB/s] 1.04TiB
As another quick sanity check to test the performance difference of sha1
to xxh64
I performed these tests:
So its clear to see that sha1
is a significant bottleneck.
My conclusion right now is that zfs-check
is actually indirectly able to handle and read sparse files efficiently, assuming that the relevant library, kernel and filesystem code supports efficient sparse file IO. The read performance seems to of been limited by sha1
.
Something else that can make a significant performance difference is the zfs-check
options --block-size
and --count
.
In my testing above I used --block-size 1024000 --count 128
aka 128 1MiB blocks per chunk-hash and zfs-check
should read 1MiB blocks at a time in this example.
Comments and critique welcome.
sounds interesting. i'm all for changing the default hash algo to something like xxh64 if thats not a too obscure library. we're still in development/beta anyway, and zfs-check isnt widely used, so we can make big changes.
same goes for a different default block-size and count, i didnt do much testing to choose that. (just guestimated values)
its very cool actually, this is very usefull to test huge volumes :)
Hey. Acknowledged. It might be useful to offer the user control of the hash class via an option. And perhaps default to a (currently) secure but faster hash like sha512.
For example offering good options like --hash=secure
and --hash=fast
and also allow a specific hash to be specified. Defaulting to --hash=secure
?
Unrelated to sparse files, from a security perspective it might be desirable to ask the user for a hash "pepper" which would make the computed hash lists more secure and disconnected from the contents of the original blocks. And/Or perhaps warn users about the potential risks of non salted or peppered hashes.
Thinking out loud it would be possible to make further optimisations for reading sparse blocks, similar to the sparse logic I outlined in post 1. Hole chunks have identical checksums and therefore computation of the zeros can be skipped and be replaced by pre-computed values? This might be overly complicated however and be a source of bugs. So it's perhaps a risk vs. rewards decision.
i think you're confusing password hash security with block-hash security.
even with a chunksize of only 1MB, it would be impossible to guess the original file from a hash i think?
yes, it might be possible to create a hash collision if you use a insecure hash function, which is bad in case of password security and authentication.
but a collision is no problem in our case.
i tried to calculate the odds of guessing a 1MB chunk, from a hash, but python gave up when the number had 4300 digits :P
(so i dont think we need to make it configurable yet)
i think you're confusing password hash security with block-hash security.
even with a chunksize of only 1MB, it would be impossible to guess the original file from a hash i think?
yes, it might be possible to create a hash collision if you use a insecure hash function, which is bad in case of password security and authentication.
but a collision is no problem in our case.
My thinking certainly comes from the security of password hashes and content token hashes, for example content tokens used to tokenise GDPR sensitive data sets. From an InfoSec best practice perspective, I thought it might be good to increase the improbability and decouple the chunk hashes from the actual chunk bytes. Yes, the probability of collisions is low for a given call to zfs-check
. I was looking at it more from the perspective that if someone had a bunch of chunk-hashes for known chunks (for a given --block-size
and --count
, a factor that further increases the improbability for non-default values), those chunk-hashes could potentially be used to check against someone else's chunk-hashes and if there were collisions - obviously then someone would know the contents of the given chunk-hash. Adding a pepper or some other measure would further increase the improbability of seeing collisions and separate the chunk-hashes from the actual chunk bytes. I hope I have explained this OK. It is certainly an InfoSec paranoia consideration. It comes from my "secure by design" development background.
Back on to the main topic
I ran my tweaked zfs-check
against my store5
and store5-backup
pools for the following 4TiB sparse raw GPT partition file containing XFS filesystem stored on ZFS dataset.
The jobs took approx. 6.08 and 5.51hours respectively.
Here are some samples from the left job
...
Thu Oct 12 06:26:17 PM UTC 2023 0:01:59 [76.1MiB/s] [ 141MiB/s] 368GiB
Thu Oct 12 06:28:17 PM UTC 2023 0:01:59 [3.71GiB/s] [2.92GiB/s] 716GiB
Thu Oct 12 06:30:17 PM UTC 2023 0:01:59 [ 208MiB/s] [2.61GiB/s] 1.01TiB
Thu Oct 12 06:32:17 PM UTC 2023 0:01:59 [ 138MiB/s] [ 156MiB/s] 1.02TiB
...
Thu Oct 12 11:34:29 PM UTC 2023 0:01:59 [83.7MiB/s] [ 105MiB/s] 3.35TiB
Thu Oct 12 11:36:29 PM UTC 2023 0:01:59 [66.1MiB/s] [ 102MiB/s] 3.36TiB
Thu Oct 12 11:38:29 PM UTC 2023 0:01:59 [ 113MiB/s] [ 104MiB/s] 3.37TiB
Thu Oct 12 11:40:29 PM UTC 2023 0:01:59 [3.71GiB/s] [1.39GiB/s] 3.53TiB
Thu Oct 12 11:42:29 PM UTC 2023 0:01:59 [3.98GiB/s] [3.77GiB/s] 3.97TiB
Thu Oct 12 11:44:29 PM UTC 2023 0:00:05 [3.95GiB/s] [3.95GiB/s] 4.00TiB
Here are some samples from the right job
...
Thu Oct 12 06:27:35 PM UTC 2023 0:01:59 [ 110MiB/s] [ 133MiB/s] 365GiB
Thu Oct 12 06:29:35 PM UTC 2023 0:01:59 [3.61GiB/s] [2.04GiB/s] 607GiB
Thu Oct 12 06:31:35 PM UTC 2023 0:01:59 [ 147MiB/s] [3.47GiB/s] 1.00TiB
Thu Oct 12 06:33:35 PM UTC 2023 0:01:59 [ 141MiB/s] [ 182MiB/s] 1.02TiB
...
Thu Oct 12 11:01:45 PM UTC 2023 0:01:59 [92.3MiB/s] [88.1MiB/s] 3.35TiB
Thu Oct 12 11:03:45 PM UTC 2023 0:01:59 [83.7MiB/s] [88.1MiB/s] 3.36TiB
Thu Oct 12 11:05:45 PM UTC 2023 0:01:59 [76.0MiB/s] [89.0MiB/s] 3.37TiB
Thu Oct 12 11:07:45 PM UTC 2023 0:01:59 [3.63GiB/s] [1.51GiB/s] 3.55TiB
Thu Oct 12 11:09:45 PM UTC 2023 0:01:59 [3.97GiB/s] [3.74GiB/s] 3.99TiB
Thu Oct 12 11:11:46 PM UTC 2023 0:00:02 [3.83GiB/s] [3.83GiB/s] 4.00TiB
So its clear to see that zfs-check
existing code + a fast hash class is capable of relatively good sparse file chunk-hashing.
Good news (for my data at least 😉), both snapshots appear to be integral which is nice given my recent events loosing a disk etc. The pools and datasets have slightly different configuration, specifically around the compression
and checksum
settings.
root@viper:~# md5sum /store5/data/vm/raw/summer-checksums.txt /store5-backup/data/vm/raw/summer-checksums.txt
f7f6164de7edd861ea2e0eb55176dc39 /store5/data/vm/raw/summer-checksums.txt
f7f6164de7edd861ea2e0eb55176dc39 /store5-backup/data/vm/raw/summer-checksums.txt
A note on HOLE chunks which in this zfs-check
invocation always had the checksum of 25a95bba19ee2310
. There were 15234 HOLE chunks which in theory could of been determined by checking if the chunk was a HOLE before computing the hash. In theory the determination would be faster than computation. i.e. skipping the read()
and calling lseek(fd, offset, SEEK_HOLE)
until the next DATA chunk is detected and output the known HOLE checksum for the HOLE chunks rather than computing the hash of the zeros.
root@viper:~# pv /store5/data/vm/raw/summer-checksums.txt | awk '{ print $2 }' | sort | uniq -c | sort -k1nr,1 |head
742KiB 0:00:00 [91.9MiB/s] [==================================================================>] 100%
15234 25a95bba19ee2310
1 0000cf2484c5a0fb
1 0004d6fe7593acfe
Sanity check on the HOLE chunk checksum:
root@viper:~# dd if=/dev/zero bs=1024000 count=128 status=progress | xxh64sum
128+0 records in
128+0 records out
25a95bba19ee2310 stdin
Math bug: I used duckduckgo or google to convert 1 MiB to bytes and didn't verify the answer was correct and for some reason ended up with 1024000 bytes (rounding issue?) but the correct result is 1MiB = 1048576 (1024^2), the answer I got and used was missing 24 KiB ¯\(ツ)/¯. Its not a deal breaker for this testing and doesn't invalidate anything. One potential negative is sub optimal unaligned block read I/O but in theory ZFS I/O scheduler should smooth most of that out anyway. netgraph showed the average I/O operation size was consistently ~1MiB:
So where do we go from here?
I'm not sure I have a nice environment to run the docker test stuff without having to create something. Would you prefer to take ownership and make the changes to offer a --hash
option to the user?
I could offer to create a lxc to run docker on my lab and try and get the test stuff working and then make a PR? This might get delayed because tbh messing around with lxc and docker isn't my favourite pass-time! Its been a PITA in the past.
How are you running docker and the tests in your development setup?
ah i understand where you're coming from in that case. for now, lets not add salts to keep things simple.
just create a pullrequest that changes the default hash. if you really want a --hash option feel free to add it. dont worry about the test too much. also: github will run the testsuite when you create a pull request.
Acknowledged. I'll put a PR together in the near future.
A quick note from further testing. It would appear 4.21GiB/s would appear to the upper CPU boundary on my system for a single zfs-check
python thread.
Spec: 2x Xeon CPU E5-2620 v3 (2.4GHz+, Haswell Q3'14). 6 cores per CPU = 12 cores (24 threads) 15 MB Cache
128 GB DDR4 ECC RAM @1866MT/s.
Hypothetically zfs_autobackup/BlockHasher.py
could be adapted for multithreading and X number of chunks could be read in serial and then X number of checksummed in parallel, obviously paying attention to retain the output order.
Greetings,
This is a feature/enhancement request for
zfs-check
, and a braindump on possible implementation.There has been discourse on this topic in the past 2022-June on reddit regarding this topic. Screenshot for quick reference and here is a link to the thread:
click to unfold screenshot
Having recently restored some datasets from backups, I had reason to do some checksums on various parts of the data, and
zfs-check
came back to my mind as a nice tool for this.I checked the latest code in the repo and didn't see any support for sparse file handling yet, so I thought I'd make this request and provide some details. If I get the time I will make a PR for it myself, but in the meantime here is the request/info.
To illustrate what I'm referring to take for example a raw GPT partition file with XFS partition(s) stored on ZFS dataset. Example creation command
qemu-img create -f raw test.raw 4T
then the usualgdisk
commands to create the GUID partition table, andmkfs.xfs -m crc=1,reflink=1
to create a filesystem partition.This set of commands would create a sparse 4TiB file, in my case stored on a zfs dataset, create the partition table, and format a partition with XFS and this raw disk would be provisioned on a kvm for example running on proxmox.
Today when
zfs-check
checks such a file it will chunk up and iterate over all the bytes in the sparse file. It will not treat the sparse blocks differently to normal blocks. Therefore it will actually read 4TiB bytes and compute the checksums for the chunks etc. This was covered in the reddit post from 2022-June (see above).What is a sparse file? A sparsely allocated file is a file that contains one or more HOLE(s). A HOLE is a sequence of zeros that (normally) has not been allocated in the underlying file storage. Cite: https://linux.die.net/man/2/lseek
As a sanity check for the post in 2022-June I checked how long it takes to "optimally read" the sparse
store1
4TiB raw disk, with its normal allocation of ~2.6TiB the remainder sparse allocation.zfs-check
readSo, by my calculation there is definitely room for optimisation in
zfs-check
in this regard.An example of common utilities handling sparse files efficiently as follows, consider using
dd
to read a 4TiB sparse file in the following way andxxh64sum
to create the checksum:OK, nothing special on face value BUT when we study the details during the operation using
pv
to monitor the relevant file descriptor, the output format is$(date) %{Current data transfer rate} %{Average data transfer rate} %{Bytes transferred so far}
:What you see here is a human readable log containing 2 minute samples of the progress and data rates for the
dd
operation.Note the 3rd column, the
%{Average data transfer rate}
. You'll notice how sometimes a sample is few hundred MiB/s which is determined by the physical limits of the underlying vdev(s) read speed when reading normally allocated blocks, and sometimes you'll see ~2GiB/s which is obviously an order of magnitude faster. The high speed samples document wheredd
, its libraries, the kernel and OpenZFS have conspired together to efficiently process the sparse blocks and sections of the the sparse 4TiB raw partition file.NB: In my example I used
xxHash
by Yann Collet because as per the name: Extremely fast non-cryptographic hash algorithm (github link) its modern, maintained and it can provide "near RAM speed" hashing speed.zfs-check
defaults tosha1
which has limited performance and perhaps not best suited for sparse file handling. Note the hash performance comparison table on the xxHash GitHub README.The sparse allocation optimisation is also reflected in the 4th column
%{Bytes transferred so far}
: in the samples that illustrate reading of the sparse allocated blocks, the amount of logical data read per sample is naturally a lot more than the samples for normally allocated blocks.Outcome: progress in reading through the file is much quicker (than
zfs-check
) because the sparse allocations/sections are handled efficiently.A quick note on the system in my examples. Its a entry level enterprise storage node running proxmox 7.x which is Debian bookworm based and running the Ubuntu 6.2.x kernel. 12 physical Xeon CPU cores over two sockets and 128 GB RAM.
For my
dd
example, what I've not yet understood is which code (or code combo) in the chain is responsible for the sparse file reading optimisation. From glancing at the thedd
code it doesn't seem that the optimisation is happening at that level, and you can see from mydd
example I'm not giving it special arguments to handle sparse files. I think the optimisation is more likely to be located in lower levelread() and seek()
functions and a combination of the filesystem support for sparse file allocation. OpenZFS definitely supports efficient sparse file reads as demonstrated in thedd
example provided.On Debian the
coreutils
package which containsdd
source might have some clues: https://sources.debian.org/src/coreutils/9.1-1/. I took a quick look but didn't find a smoking gun, in factdd
code is primarily usinglseek (fd, offset, SEEK_CUR)
which means I missed something or the optimisation is happening actually in the kernel/filesystem code (for example fs/read_write.c here) or in OpenZFS code. NB: Proxmox uses the Ubuntu kernel.For reference and perhaps part of the solution for
zfs-check
, for python there is a useful but probably not exactly what is needed answer here: https://stackoverflow.com/a/62312222. This approach seeks the entire file first to determine the ranges of DATA and HOLE(s).From what I can tell
zfs_autobackup/BlockHasher.py
will need an logic update something like the following:lseek(fd, offset, SEEK_HOLE)
to determine if the offset is the end of the filefd
- the existing logic can be usedlseek(fd, offset, SEEK_DATA)
until a HOLE chunk is detected? (a chunk that is sequence of zeros)lseek(fd, offset, SEEK_HOLE)
iterate forward seeking n chunks to find the end of the HOLE.offset
then still in a contiguous holenr_contiguous_hole_chunks
offset
and != EOF then there is some more DATA to processzeros
xnr_contiguous_hole_chunks
xchunk_size
fd
to resume processing DATA at the end offset of the last detected contiguous HOLE chunkfd
to resume processing DATA at the end offset of the last detected contiguous HOLE chunk. The logic then iterates of the remaining DATA.It would be interesting to find examples of other code that have implemented similar logic to sanity check this logic and check for any inefficiencies or flaws.
💡 IDEA: Maybe its enough to switch the
zfs_autobackup/BlockHasher.py
from usingfh.seek(pos)
to usingos.lseek(fd, pos, whence)
and usingxxhash
rather than the slowersha1
and the optimisation would be automatically performed at a lower level? As per thedd
example which is usinglseek (fd, offset, SEEK_CUR)
and no mention ofSEEK_DATA
orSEEK_HOLE
. This might be worth a try before trying to invent new logic per my brain dump herein.python ref for
os.lseek()
: https://docs.python.org/3/library/os.html#os.lseekpython ref for
fd.seek()
: https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objectsThanks for reading, look forward to any comments or critique.
Cheers
Kyle