Open clarfonthey opened 4 years ago
So this is a request for hexyl
to be faster?
Closing for now, as it's unclear to me.
Sorry I never got to this. And, well, yes.
All the major filesystems expose the concept of sparse files, where the file is fragmented on disk and entire sections are marked as being zeroes without actually being stored.
For example, take an 8GiB disk image that is freshly formatted from a zeroed-out disk, with one or two files written to it that are both a few KiB. The actual image will only take up a few KiB on disk because the OS simply marks the file as being all zero except for a few sections.
While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily.
I should also add, on filesystems where sparse files, the system calls that request for zeroed out sections will just always return nothing, making any implementations not have to worry about those cases.
Unfortunately, due to the type signature of the print_all
function, this isn't really possible. In order to access sparse file search on Windows and Linux (haven't checked MacOS) the data type must be a file. However, the print_all
function accepts any type that implements Read
, which makes it impossible to check whether it's reading a file or not. One way to change this is to make print_all
accept the Input
enum, but that would limit any generic usage of hexyl
to read any data except a File
or Stdin
.
While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily.
@clarfonthey Is this really the case? Can you provide any benchmark/timing results? Ideally, in a reproducable setup.
I mean, it's pretty easy to just truncate -s 2G test
and then run hexyl test
to see that while the file takes up no space on disk, it will loop for quite a while.
Sure. I was questioning if it is really taking several minutes or not. In fact, for an 8G sparse file, hexyl
takes 5.329 s ± 0.042 s (benchmark below).
But okay, the problem is real. If not for 8G, then for 800G. And I trust you that this is a valid use case.
So the next question is: how could we potentially implement this (ignoring for now what @sharifhsn talked about above). Would the idea be to do something like the following?
If we are in squeezing mode (i.e. we already detected a "full line" of zeros), and haven't read any data for X bytes: switch to a special "fast-forward" mode which would call lseek(2)
with the current position as the offset
parameter and whence = SEEK_DATA
to jump over the hole?
(since that call could 'fail' silently and simply return the current position, we would have to make sure to only switch into fast-forward mode once until we continue to find non-zero bytes)
@clarfonthey Are you aware of any tools that do implement something similar?
Benchmark:
▶ truncate -s 3G test; echo "needle" >> test; truncate -s 8G test
▶ hyperfine --warmup=1 --runs=5 'hexyl test' 'hexdump test'
Benchmark 1: hexyl test
Time (mean ± σ): 5.329 s ± 0.042 s [User: 4.099 s, System: 1.220 s]
Range (min … max): 5.287 s … 5.388 s 5 runs
Benchmark 2: hexdump test
Time (mean ± σ): 16.443 s ± 0.071 s [User: 14.715 s, System: 1.713 s]
Range (min … max): 16.346 s … 16.544 s 5 runs
Summary
'hexyl test' ran
3.09 ± 0.03 times faster than 'hexdump test'
(xxd had to be excluded since it does not seem to have a squeezing mode... and therefore extremely slow)
xxd actually has a squeezing mode, it calls it autoskip and exposes it with -a
and -autoskip
.
Thank you. I did not know that. xxd -a test
takes almost 3 minutes (2:57 min), i.e. it's still an order of magnitude slower than both hexdump and hexyl.
Because hexyl truncates repeating sections, it would be nice to be able to have hexyl quickly skip over these sections instead of scanning them byte-by-byte.