Handle sparse files - Githubissues

clarfonthey commented 4 years ago

Because hexyl truncates repeating sections, it would be nice to be able to have hexyl quickly skip over these sections instead of scanning them byte-by-byte.

sharkdp commented 4 years ago

So this is a request for hexyl to be faster?

sharkdp commented 4 years ago

Closing for now, as it's unclear to me.

clarfonthey commented 4 years ago

Sorry I never got to this. And, well, yes.

All the major filesystems expose the concept of sparse files, where the file is fragmented on disk and entire sections are marked as being zeroes without actually being stored.

For example, take an 8GiB disk image that is freshly formatted from a zeroed-out disk, with one or two files written to it that are both a few KiB. The actual image will only take up a few KiB on disk because the OS simply marks the file as being all zero except for a few sections.

While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily.

I should also add, on filesystems where sparse files, the system calls that request for zeroed out sections will just always return nothing, making any implementations not have to worry about those cases.

sharifhsn commented 1 year ago

Unfortunately, due to the type signature of the print_all function, this isn't really possible. In order to access sparse file search on Windows and Linux (haven't checked MacOS) the data type must be a file. However, the print_all function accepts any type that implements Read, which makes it impossible to check whether it's reading a file or not. One way to change this is to make print_all accept the Input enum, but that would limit any generic usage of hexyl to read any data except a File or Stdin.

sharkdp commented 1 year ago

While hexyl going through the file naively would take several minutes of reading the zero bytes, ultimately it would truncate these sections in the actual output. Reading these files would become feasible if hexyl used the OS subroutines to query the file for its zero ranges, then used this information to skip reading those unnecessarily.

@clarfonthey Is this really the case? Can you provide any benchmark/timing results? Ideally, in a reproducable setup.

clarfonthey commented 1 year ago

I mean, it's pretty easy to just truncate -s 2G test and then run hexyl test to see that while the file takes up no space on disk, it will loop for quite a while.

sharkdp commented 1 year ago

Sure. I was questioning if it is really taking several minutes or not. In fact, for an 8G sparse file, hexyl takes 5.329 s ± 0.042 s (benchmark below).

But okay, the problem is real. If not for 8G, then for 800G. And I trust you that this is a valid use case.

So the next question is: how could we potentially implement this (ignoring for now what @sharifhsn talked about above). Would the idea be to do something like the following?

If we are in squeezing mode (i.e. we already detected a "full line" of zeros), and haven't read any data for X bytes: switch to a special "fast-forward" mode which would call lseek(2) with the current position as the offset parameter and whence = SEEK_DATA to jump over the hole?

(since that call could 'fail' silently and simply return the current position, we would have to make sure to only switch into fast-forward mode once until we continue to find non-zero bytes)

@clarfonthey Are you aware of any tools that do implement something similar?

Benchmark:

▶ truncate -s 3G test; echo "needle" >> test; truncate -s 8G test

▶ hyperfine --warmup=1 --runs=5 'hexyl test' 'hexdump test'
Benchmark 1: hexyl test
  Time (mean ± σ):      5.329 s ±  0.042 s    [User: 4.099 s, System: 1.220 s]
  Range (min … max):    5.287 s …  5.388 s    5 runs

Benchmark 2: hexdump test
  Time (mean ± σ):     16.443 s ±  0.071 s    [User: 14.715 s, System: 1.713 s]
  Range (min … max):   16.346 s … 16.544 s    5 runs

Summary
  'hexyl test' ran
    3.09 ± 0.03 times faster than 'hexdump test'

(xxd had to be excluded since it does not seem to have a squeezing mode... and therefore extremely slow)

wasamasa commented 1 year ago

xxd actually has a squeezing mode, it calls it autoskip and exposes it with -a and -autoskip.

sharkdp commented 1 year ago

Thank you. I did not know that. xxd -a test takes almost 3 minutes (2:57 min), i.e. it's still an order of magnitude slower than both hexdump and hexyl.

sharkdp / hexyl

Handle sparse files #89