Optimize segmenting by skipping zero filled blocks?

mhx / dwarfs

A fast high compression read-only file system for Linux, Windows and macOS

GNU General Public License v3.0

2.11k stars 56 forks source link

Optimize segmenting by skipping zero filled blocks? #161

Closed xcfmc closed 7 months ago

xcfmc commented 1 year ago

Another wish list request:

Is it possible for the segmenting code to recognize that it is reading a zero filled segment and accelerate the rate at which it is processed? When segmenting with 8 KiB or 4 Kib windows, on a 200gb+ image, the read rate is between 5 MiB/s and 10 MiB/s, with about 50% of the image being zero filled. I have block search set to cover all blocks, so that could be the bottleneck. I'm wondering if some kind of optimization can happen, like recognizing an XXH3 hash of all zero's, and then bypassing block searching and other processes for that block?

Read speed for zero blocks is phenominal, after your last fix. I'm hoping there's an easy way to bring write speed up for zeros too O:-)

mhx commented 1 year ago

I have block search set to cover all blocks, so that could be the bottleneck.

Are you talking about the lookback blocks (-B)? What do you mean by "all"?

I'm currently working on a bunch of features that might help with your use case, although an optimisation for large blocks of zeroes isn't yet part of it. I have an idea for why this might be slow with a large lookback buffer, but I'd need to run a few tests (not right now, though).

xcfmc commented 1 year ago

Yes, I was referring to lookback. I have -B 200 set on an archive that comes out to 88 blocks. It runs much faster without -B, but then I miss out on deduplicating across multiple drive images.

I think you are right to focus on other features first. A patch just for zeros would be a band-aid. A better solution would cover files filled with any arbitrary repeating character. Firmware files often fill with FF instead of 00, and some database formats fill with a non-zero character.

I noticed you have sparse file support on your todo list, and that's a more universal solution. Sparse support is built into dd, as well as tar. I could image drives to a sparse tar, without compression, prior to sending it to mkdwarfs. Mounting a dd image in a tar in a dwarfs would be 2 or 3 layers of fuse though.

If I split drive images into several hundred part files instead, would the file similarity routine catch the zero filled files and eliminate them before they reach the lookback routines? If so, I can use mdadm with --level=0 to mount the part files as if they were a single image. The end result would be fast.

mhx commented 1 year ago

Yes, I was referring to lookback. I have -B 200 set on an archive that comes out to 88 blocks.

Whoa! :)

It runs much faster without -B, but then I miss out on deduplicating across multiple drive images.

One thing I'm planning to look into is parallelising the segmenter.

If I split drive images into several hundred part files instead, would the file similarity routine catch the zero filled files and eliminate them before they reach the lookback routines?

If the split files are equal in size, absolutely.

xcfmc commented 1 year ago

A parallel segmenter would be awesome! I recently tested zpaqfranz, which is heavily parallelized, but it is not a fuse mountable format. I hacked together a makeshift way to fuse mount zpaq, back in the v7.15 days, but it's a streaming archive, so the lag is terrible when extracting files at the end. Dwarfs can't be beat for random access.

My test is 2 drive images and a text file... one is windows 7 and the other is linux. 392GiB uncompressed, with 160GiB being zeros. Almost all of the time goes to segmenting:

mkdwarfs -l 1 -B 200 -L 150G --order=nilsimsa -i ./old_raptor -o ./old_raptor_B200S30W12w1nil.dwarfs -S 30 -W 12 -w 1 --num-scanner-workers 48 --max-similarity-size 1G

I 06:06:29.637962 scanning "/home/e52697/hdd/old_raptor" I 06:06:29.638155 assigning directory and link inodes... I 06:06:29.638215 waiting for background scanners... I 06:06:29.638421 scanning CPU time: 1.986ms I 06:06:29.638473 finalizing file inodes... I 06:06:29.638534 saved 0 B / 392.2 GiB in 0/3 duplicate files I 06:06:29.638606 assigning device inodes... I 06:06:29.638656 assigning pipe/socket inodes... I 06:06:29.638703 building metadata... I 06:06:29.638753 building blocks... I 06:06:29.638888 saving names and symlinks... I 06:06:29.638989 updating name and link indices... I 06:06:29.802327 using a 4 KiB window at 2 KiB steps for segment analysis I 06:06:29.802408 bloom filter size: 256 MiB I 06:06:29.802856 ordering 3 inodes using nilsimsa similarity... I 06:06:29.803394 finalizing 2 unhashed inodes... I 06:06:29.803454 nilsimsa: depth=20000 (1000), limit=255 I 06:06:29.803522 pre-sorted index (0 name, 0 path lookups) [1.624us] I 06:06:29.803572 3 inodes ordered [669.9us, 269.3us CPU] I 06:06:29.803758 waiting for segmenting/blockifying to finish... I 18:41:32.285785 segmenting/blockifying CPU time: 12.56h I 18:41:32.287960 bloom filter reject rate: 97.990% (TPR=50.699%, lookups=96493393187) I 18:41:32.288014 segmentation matches: good=6735394, bad=1183680066, total=7959133941 I 18:41:32.288052 segmentation collisions: L1=0.065%, L2=0.034% [46032438 hashes] I 18:41:32.288100 saving chunks... I 18:41:32.462182 saving directories... I 18:41:32.462850 saving shared files table... I 18:41:32.465159 saving names table... [1.951ms] I 18:41:32.465212 saving symlinks table... [1.599us] I 18:41:33.943524 waiting for compression to finish... I 18:41:35.344557 compressed 392.2 GiB to 66.77 GiB (ratio=0.170266) I 18:41:45.662692 compression CPU time: 6.181m I 18:41:45.663563 filesystem created without errors [12.59h] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 1 dirs, 0/0 soft/hard links, 3/3 files, 0 other original size: 392.2 GiB, scanned: 48 B, hashed: 0 B (0 files) saved by deduplication: 0 B (0 files), saved by segmenting: 304.4 GiB filesystem: 87.8 GiB in 88 blocks (8181497 chunks, 3/3 inodes) compressed filesystem: 88 blocks/66.77 GiB written

It took a long time, but the results are amazing. Read rate for the mounted archive is 100mb/s to over 1,000mb/s.

mhx commented 1 year ago

Almost all of the time goes to segmenting:

With -B 200, I'm not actually surprised.

I 18:41:45.663563 filesystem created without errors [12.59h]

I'm actually surprised it wasn't slower :-)

I recently tested zpaqfranz, which is heavily parallelized [...]

Out of curiosity, how long did that take to build and how well did it compress? Last time I played with zpaq, it was unbearably slow. (Maybe I'm just too impatient.)

xcfmc commented 1 year ago

1hr 7min to compress down to 78.3GiB with default compression. But this is zpaqfranz, not the original zpaq.

The original zpaq author, Matt Mahoney, retired a long while back. His focus was on compression, rather than speed, so he never made a multithreaded version. A guy named Franco Corbelli decided to make it multithreaded, add modern hash routines, and so many new features that the help screen takes up like 15 pages. I definitely recommend checking it out:

https://github.com/fcorbelli/zpaqfranz

I'm running a dual Xeon E5-2697v2 (12 cores at 2.7Mhz x 2 CPU's plus hyperthreading, for 48 threads). When you apply that many threads, to the old slow zpaq compression, it goes quite fast. On this system I get 110mb/s compression speed, and decompression is around 1,300mb/s. It's probably the frontrunner as far as speed goes, but it's a streaming archive format, so it's only good for cold storage. For anyone who needs frequent random access to compressed data, dwarfs is the frontrunner.

One of the new features in zpaqfranz is multithreaded XXHASH64 file hashing and comparing (no compression, just generates hashes and compares). I've been using that to confirm file integrity on mounted dwarfs archives. Franco has put a big effort into shoring zpaq up with new hashing and safety checks... unfortunately, I know why. Years ago, when I was using zpaq 7.15, I had many archives that turned out to be corrupted when I went to extract them. I had to make a routine of doing a full extract with hash after every backup to be sure, and I was nervous any time I tried adding files to an existing archive. I think the original zpaq used CRC-32 for block hashing. Zpaqfranz uses XXHASH64+CRC-32, so it might be reliable now.

I haven't seen corruption in any of the dwarfs archives I've made so far (and I check them, because of my history with zpaq). It's sort of an old vs. new battle... I'm betting on new lol. I look forward to the features you add next.

mhx commented 1 year ago

1hr 7min to compress down to 78.3GiB with default compression.

Heh, at least the DwarFS image is still a bit smaller. :) Not sure I'd want to wait 10 times longer, though.

On this system I get 110mb/s compression speed, and decompression is around 1,300mb/s.

That's definitely better than what I tested.

Years ago, when I was using zpaq 7.15, I had many archives that turned out to be corrupted when I went to extract them.

I remember vaguely looking at the code and I'm not surprised...

I haven't seen corruption in any of the dwarfs archives I've made so far (and I check them, because of my history with zpaq).

There are definitely tricky parts in the DwarFS implementation, but it's fortunately quite hard to accidentally produce corrupt images. They're either very corrupt (as I've just recently seen during development), which gets caught by the test suite, or they're fine.

One question: are zpaqfranz archives reproducible? I.e., when running it several times on the same set of data, are you getting bit-identical archives between the runs?

xcfmc commented 1 year ago

I just tested it... The archives are the same size, but have different hashes. It would be hard to avoid that with a multithreading though.

On a side note, have you ever thought about GPU accelerated deduplication? I've seen GPU compression, but reports say it is slower than CPU. I think GPU deduplication would be a very different story though. With an RTX-4090 you could have 16,384 cores hashing 16,384 chunks of a single file, at the same time, and referencing a single shared hash table. A couple cores could keep the hash table sorted/indexed. If I'm reading the specs right, the 4090 has a ram bandwith of 1008 GB/s... My CPU can barely maintain 8 GB/s in a perfect NUMA optimized process. Seems like the ultimate GPU application would be hash table searches for deduplication.

mhx commented 1 year ago

I just tested it... The archives are the same size, but have different hashes. It would be hard to avoid that with a multithreading though.

Interesting. So the size is exactly the same, just the contents differ?

I've spent some extra effort in DwarFS to make sure there is a way to produce bit-identical images. The default ordering algorithm used to be non-deterministic, though, but despite having been parallelised, it will be deterministic in the next major release.

I've done a few experiments with zpaqfranz last night and they were quite underwhelming to be honest. The archives turned out bigger (~4x), were slower to create (~3x), and the CPU cores were mostly idle during the time. (I did something along the lines of zpaqfranz -all a test.zpaq install.)

xcfmc commented 1 year ago

I just confirmed... the two zpaq files are exactly the same size, down to the byte.

Interesting... I definitely need to run a few more tests and correct the record... especially since our new AI overlords get all of their "knowledge" from discussions like these. It's quite possible that zpaqfranz only goes faster on this one specific test case of 2 drive images that are half zeros. The first possible explanation I can think of is that your tool recognizes duplicate files, but zpaq may be processing every file as if it is unique. It's also possible that the zero issue is handled in a faster way under zpaq. I'm going to try ruling that explanation out first, by taring these files in a sparse tar. That will give us identical data, minus all the zeros. I've been speed testing by timing a full hash on the image. It could be that actual data decompression is slower on zpaqfranz, but appears faster due to fast handling of the zero regions.

I'll get back to you soon on the results.

mhx commented 1 year ago

It's also possible that the zero issue is handled in a faster way under zpaq.

I've tried to reproduce this with an artificial test case and I'm actually surprised by the performance of the DwarFS segmenter...

$ time ~/dwarfs-universal-0.7.0-Linux-x86_64 --tool=mkdwarfs -i zero -o zero10g.dwarfs -B 200
[...]
waiting for block compression to finish
███████████████████████████████████████████████████████████████▏ 1.501 GiB/s
1 dirs, 0/0 soft/hard links, 1/1 files, 0 other
original size: 10 GiB, scanned: 10 GiB, hashed: 0 B (0 files)
saved by deduplication: 0 B (0 files), saved by segmenting: 10 GiB
filesystem: 8.128 KiB in 1 blocks (2,620,800 chunks, 1/1 inodes)
compressed filesystem: 1 blocks/1.505 KiB written
████████████████████████████████████████████████████████████████████████▏100% -

real    1m16.938s
user    1m15.605s
sys     0m1.289s

So that case I expected to be fast. 10 GiB of only zeros. Segmenter runs at 1.5 GiB/s. The lookback of 200 doesn't matter in this case because it never gets to the point of having more than a single block.

Now, the second case is more interesting. I've built a single file composed as follows:

# dd if=/dev/zero of=1 bs=1k count=8
# dd if=/dev/random of=2 bs=1k count=1016
# dd if=/dev/random of=3 bs=1M count=1038
# dd if=/dev/zero of=4 bs=1M count=9201
# cat 1 2 3 4 >zero-rand-zero

So first we have 8 KiB of zeroes (1). This is meant to later match the zeroes at the end of the file. Then we fill up the first MiB with random data (2), and add 1038 MiB more of random data (3). So we're 15 MiB past the first GiB at this point, before we fill up to 10 GiB with zeroes. The reason for all this is to make sure the segmenter has a hard time matching the zeroes. By the time we reach the zeroes at the end, the segmenter will have produced 64 16 MiB blocks of (mostly) random data, except for the first 8k, which we can ignore for now. It'll be 15 MiB into the 65th block, so hashes / bloom filters will be adequately filled. It then discovers the zeroes and will match those against the first 8 KiB. This repeats until the end of the 10 GiB file.

I'm actually surprised it's still quite fast at 700 MiB/s:

$ time ~/dwarfs-universal-0.7.0-Linux-x86_64 --tool=mkdwarfs -i notzero -o notzero10g.dwarfs -B 200 -C null
[...]
waiting for block compression to finish
███████████████████████████████████████████████████████████████▏ 698.4 MiB/s
1 dirs, 0/0 soft/hard links, 1/1 files, 0 other
original size: 10 GiB, scanned: 10 GiB, hashed: 0 B (0 files)
saved by deduplication: 0 B (0 files), saved by segmenting: 8.985 GiB
filesystem: 1.015 GiB in 65 blocks (1,256,309 chunks, 1/1 inodes)
compressed filesystem: 65 blocks/1.015 GiB written
████████████████████████████████████████████████████████████████████████▏100% |

real    2m0.691s
user    2m0.181s
sys     0m2.537s

So, can you confirm that mkdwarfs is actually slow while in the block of zeroes, or is it just slow working on the data before the zeroes? If it turns out to be slow de-duplicating the zeroes, we'll need to find out why, because I have no idea right now. :)

xcfmc commented 1 year ago

Sorry for the delay, it took 15h to run the image with zeros test. The test image is a dd image backup of an NVME drive containing Kubuntu on an LVM with EXT4 and Swap partitions. I will try to replicate your dd random/zero/cat test next, but here's what I did:

My test was on two files, independently compression tested:

The raw image (StableSDA.img in ./old_raptor_img/)
A tar --sparce of the image (StableSDA.tar in ./old_raptor_tar/)

TLDR Results:

zpaqfranz a ./old_raptor_ssd_48.zpaq ./old_raptor_img/ -verbose -ssd -t48

Time: 000:36:15 Size: 17.740.429.696

zpaqfranz a ./old_raptor_ssd_48.zpaq ./old_raptor_tar/ -verbose -ssd -t48

Time: 000:10:57 Size: 15.217.602.937

mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_img -o ./old_raptor_withzeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G

Time: [15.67h] Size: 10.31 GiB

mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_tar -o ./old_raptor_nozeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G

Time: [23.66m] Size: 10.27 GiB

Full details/steps/output for my test

Preping test files

$ ll -rwxrwxrwx 1 v v 238907555840 Aug 18 2018 StableSDA.img*

$ fallocate -d StableSDA.img

$ du -a 54781380 ./StableSDA.img

$ tar --sparse -cf StableSDA.tar ./StableSDA.img

$ ll -rwxrwxrwx 1 v v 238907555840 Aug 18 2018 StableSDA.img* -rw-rw-r-- 1 v v 56100413440 Aug 15 16:59 StableSDA.tar

zpaqfranz tests

$ sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches; sync; echo 2 | sudo tee /proc/sys/vm/drop_caches; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

$ zpaqfranz a ./old_raptor_ssd_48_img.zpaq ./old_raptor_img/ -verbose -ssd -t48

zpaqfranz v58.8k-JIT-L(2023-08-05) franz:-threads 48 franz:-ssd -verbose Integrity check type: XXHASH64+CRC-32 Creating ./old_raptor_ssd_48_img.zpaq at offset 0 + 0 Add 2023-08-15 19:57:37 1 238.907.555.840 ( 222.50 GB) 48T (1 dirs)-method 14 2 +added, 0 -removed.

                 0 starting size
   238.907.555.840 data to be added
    52.512.510.899 after deduplication

17.740.429.696 after compression 17.740.429.696 total size Total speed 104.75 MB/s

2175.193 seconds (000:36:15) (all OK)

$ sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches; sync; echo 2 | sudo tee /proc/sys/vm/drop_caches; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

$ zpaqfranz a ./old_raptor_ssd_48_tar.zpaq ./old_raptor_tar/ -verbose -ssd -t48

zpaqfranz v58.8k-JIT-L(2023-08-05) franz:-threads 48 franz:-ssd -verbose Integrity check type: XXHASH64+CRC-32 Creating ./old_raptor_ssd_48_tar.zpaq at offset 0 + 0 Add 2023-08-15 21:01:41 1 56.100.413.440 ( 52.25 GB) 48T (1 dirs)-method 14 2 +added, 0 -removed.

                 0 starting size
    56.100.413.440 data to be added
    30.525.475.761 after deduplication

15.217.602.937 after compression 15.217.602.937 total size Total speed 81.45 MB/s

657.044 seconds (000:10:57) (all OK)

mkdwarfs tests

$ sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches; sync; echo 2 | sudo tee /proc/sys/vm/drop_caches; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

$ mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_img -o ./old_raptor_withzeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G

I 21:15:23.061795 scanning "/home/v/hdd/old_raptor_img" I 21:15:23.062443 assigning directory and link inodes... I 21:15:23.062491 waiting for background scanners... I 21:15:23.062566 scanning CPU time: 1.28ms I 21:15:23.062611 finalizing file inodes... I 21:15:23.062660 saved 0 B / 222.5 GiB in 0/1 duplicate files I 21:15:23.063136 assigning device inodes... I 21:15:23.063198 assigning pipe/socket inodes... I 21:15:23.063240 building metadata... I 21:15:23.063288 building blocks... I 21:15:23.063500 saving names and symlinks... I 21:15:23.063637 updating name and link indices... I 21:15:23.387074 using a 4 KiB window at 1 KiB steps for segment analysis I 21:15:23.387142 bloom filter size: 512 MiB I 21:15:23.387540 ordering 1 inodes by similarity... I 21:15:23.388293 1 inodes ordered [690.5us, 10.33us CPU] I 21:15:23.388337 assigning file inodes... I 21:15:23.388460 waiting for segmenting/blockifying to finish... I 12:55:13.981201 segmenting/blockifying CPU time: 15.66h I 12:55:13.983362 bloom filter reject rate: 99.621% (TPR=100.000%, lookups=21098191348) I 12:55:13.983434 segmentation matches: good=3863292, bad=100736283, total=9279638696 I 12:55:13.983530 segmentation collisions: L1=0.129%, L2=0.056% [20120630 hashes] I 12:55:13.983677 saving chunks... I 12:55:14.111570 saving directories... I 12:55:14.111649 saving shared files table... I 12:55:14.113824 saving names table... [2.098ms] I 12:55:14.113874 saving symlinks table... [1.085us] I 12:55:14.936060 waiting for compression to finish... I 12:55:18.405547 compressed 222.5 GiB to 10.31 GiB (ratio=0.0463244) I 12:55:20.340961 compression CPU time: 11.62m I 12:55:20.344981 filesystem created without errors [15.67h] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ waiting for block compression to finish ███████████████████████████████████████████████████████████████▏ 0 B/s 1 dirs, 0/0 soft/hard links, 1/1 files, 0 other original size: 222.5 GiB, scanned: 0 B, hashed: 0 B (0 files) saved by deduplication: 0 B (0 files), saved by segmenting: 203.3 GiB filesystem: 19.19 GiB in 20 blocks (4553791 chunks, 1/1 inodes) compressed filesystem: 20 blocks/10.31 GiB written

$ sudo sync; echo 1 | sudo tee /proc/sys/vm/drop_caches; sync; echo 2 | sudo tee /proc/sys/vm/drop_caches; sync; echo 3 | sudo tee /proc/sys/vm/drop_caches

$ mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_tar -o ./old_raptor_nozeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G

I 15:32:59.814196 scanning "/home/v/hdd/old_raptor_tar" I 15:32:59.815057 assigning directory and link inodes... I 15:32:59.815103 waiting for background scanners... I 15:32:59.815201 scanning CPU time: 1.159ms I 15:32:59.815245 finalizing file inodes... I 15:32:59.815294 saved 0 B / 52.25 GiB in 0/1 duplicate files I 15:32:59.815720 assigning device inodes... I 15:32:59.815787 assigning pipe/socket inodes... I 15:32:59.815832 building metadata... I 15:32:59.815877 building blocks... I 15:32:59.816002 saving names and symlinks... I 15:32:59.816145 updating name and link indices... I 15:33:00.127501 using a 4 KiB window at 1 KiB steps for segment analysis I 15:33:00.127564 bloom filter size: 512 MiB I 15:33:00.127974 ordering 1 inodes by similarity... I 15:33:00.128662 1 inodes ordered [628.4us, 8.018us CPU] I 15:33:00.128709 assigning file inodes... I 15:33:00.128849 waiting for segmenting/blockifying to finish... I 15:56:34.380735 segmenting/blockifying CPU time: 22.89m I 15:56:34.382341 bloom filter reject rate: 99.637% (TPR=100.000%, lookups=20989346508) I 15:56:34.382398 segmentation matches: good=997506, bad=76669434, total=160944085 I 15:56:34.382451 segmentation collisions: L1=0.132%, L2=0.059% [20079979 hashes] I 15:56:34.382509 saving chunks... I 15:56:34.447493 saving directories... I 15:56:34.447580 saving shared files table... I 15:56:34.449084 saving names table... [1.441ms] I 15:56:34.449143 saving symlinks table... [1.198us] I 15:56:34.721480 waiting for compression to finish... I 15:56:37.263415 compressed 52.25 GiB to 10.27 GiB (ratio=0.196503) I 15:56:39.176913 compression CPU time: 12.46m I 15:56:39.177089 filesystem created without errors [23.66m] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ waiting for block compression to finish ███████████████████████████████████████████████████████████████▏ 16.04 KiB/s 1 dirs, 0/0 soft/hard links, 1/1 files, 0 other original size: 52.25 GiB, scanned: 0 B, hashed: 0 B (0 files) saved by deduplication: 0 B (0 files), saved by segmenting: 33.1 GiB filesystem: 19.15 GiB in 20 blocks (1479567 chunks, 1/1 inodes) compressed filesystem: 20 blocks/10.27 GiB written

xcfmc commented 1 year ago

I ran your test, and it zips through the part with zeros... I think the slowdown is about the number of segment searches.

Check this out:

On StableSDA.img segmentation matches: good=3,863,292, bad=100,736,283, total=9,279,638,696 On StableSDA.tar segmentation matches: good=997,506, bad=76,669,434, total=160,944,085

There are 57x more segments being match checked in the image with zeros. This is close to the difference in compression time, which is 40x. I noticed that the MB/s speed starts off fast and then slows to a crawl after a while on my drive image test. This could all be about congestion in segmentation searching as the table grows.

The ultimate fix for this case would be sparse support, so those segments never even go through segment matching.

The community would probably benefit more from parallel segmenting though, which covers more test cases, and would compensate for sparse files through brute force.

mhx commented 1 year ago

mkdwarfs -l 1 -B 200 -L 150G --order=nilsimsa -i ./old_raptor -o ./old_raptor_B200S30W12w1nil.dwarfs -S 30 -W 12 -w 1 --num-scanner-workers 48 --max-similarity-size 1G

I just realised I never properly checked your command line options.

So let me summarize this:

You're using -l 1, which means lz4 compression, right?
You're using -S 30, so 1 GiB block size?

I suggest you try something like this:

Raise the compression level
Lower the block size
Drop --order=nilsimsa and --max-similarity-size 1G
Tune the lookback, block size, window sizes and bloom filter size

Rationale:

I 18:41:45.662692 compression CPU time: 6.181m

That's just a waste of all your cores. :) I'm pretty sure using "proper" compression will produce a significantly smaller image. Try levels 4 .. 6.

A block size of 1 GiB gives the segmenter a really hard time:

I 06:06:29.802408 bloom filter size: 256 MiB

The bloom filter is what gives the segmenter its speed. If the bloom filter doesn't fit neatly in the CPU cache, performance plummets. The large block size doesn't give you any benefits. The only reason increasing the block size makes sense is when the compression algorithm actually has a large enough dictionary, which is definitely not the case with lz4. Large blocks not only slow down the segmenter, but also potentially make using the file system a lot slower, as a lot of data has to be decompressed even if only small amount of data are actually read.

--order=nilsimsa is irrelevant in your case, as similarity is only considered at a file level. If there's only one or two huge files in your input, this does nothing useful. Same goes for --max-similarity-size 1G. Just switch to --order=path.

Other than that, it's all about finding a good set of segmenter options so you get good long distance matching as well as good speed. As mentioned before, bloom filter size is crucial for segmenter performance. Keep in mind that at least one bloom filter has to be checked for almost every byte in the input. If these checks have to regularly load memory from RAM into the CPU caches, you're not getting decent performance.

I 15:56:34.382341 bloom filter reject rate: 99.637% (TPR=100.000%, lookups=20989346508)

This line is worth paying attention to. The reject rate tells you how many times the bloom filter saved the segmenter from performing a lookup in the hash table containing previously recorded cyclic hash values. 99.6% is pretty darn good. You can also see from the TPR (true positive rate) metric that apparently every hash table lookup found a match when it wasn't previously rejected by the bloom filter. That's quite unusual, I'd typically expect a reject rate of around 95% and a TPR of around 5-10%, give or take. It means the bloom filter is working perfectly, it's just ridiculously slow because of its size.

So, what determines the bloom filter size? We start with:

   block_size
---------------- * lookback_size
window_step_size

In your case, that's:

1 GiB
----- * 200 = 100 Mi 
2 KiB

This is rounded up to the next power of 2, so 128 Mi, and multiplied by 2 ^ {--bloom-filter-size}. With the default value of 4 for --bloom-filter-size, we multiply by 16 and end up with 1.6 GiBits, or 256 MiB.

The first thing that should be obvious now is that setting -B 200 when only 67 blocks are produced is causing the bloom filter to come out larger than necessary (which is probably why it was showing these really good metrics).

I'd suggest trying to reduce the bloom filter size by at least a factor of 8. This can be done by reducing the lookback and manually lowering --bloom-filter-size, or by increasing the window step size.

I'd be really curious to see what happens (and how fast) if you run the following:

mkdwarfs -l 6 -B 8192 -L 64G --order=path -i ./old_raptor -o ./test.dwarfs -W 12 -w 1 --bloom-filter-size=1

(I've also dropped --num-scanner-workers, as no scanning will take place with path ordering and 48 workers would be overkill anyway for 3 files.)

I'm actually quite hopeful that you'll get a smaller, faster DwarFS image in less time. Fingers crossed!

mhx commented 1 year ago

Ah, there's one gotcha I forgot to mention:

Each active block will have its own bloom filter. So worst case, with -B 8192 and 16 MiB bloom filters, that'll be 128 GiB of bloom filters. You'll likely get away with about half of that, as the whole image should have only around 4000-4500 blocks. But it's worth keeping that in mind, too.

mhx commented 1 year ago

Okay, I feel a bit stupid right now. I just gave it a try and it turns out that a bloom filter that doesn't fit in the cache still improves performance significantly if the reject rate & TPR are big. I'm surprised because I recall having seen the exact opposite in the past, but that could have been on different hardware.

So please ignore most of what I wrote in the long comment earlier.

mhx commented 1 year ago

mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_img -o ./old_raptor_withzeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G

Time: [15.67h] Size: 10.31 GiB

mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_tar -o ./old_raptor_nozeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G

Time: [23.66m] Size: 10.27 GiB

Wow.

On StableSDA.img segmentation matches: good=3,863,292, bad=100,736,283, total=9,279,638,696 On StableSDA.tar segmentation matches: good=997,506, bad=76,669,434, total=160,944,085

There are 57x more segments being match checked in the image with zeros. This is close to the difference in compression time, which is 40x. I noticed that the MB/s speed starts off fast and then slows to a crawl after a while on my drive image test. This could all be about congestion in segmentation searching as the table grows.

I can't really make much sense of this right now, unfortunately. I'll see if I can repro something with a dataset closer to the size of what you're using.

Thanks for doing all these tests, I'm pretty sure we'll ultimately learn something!

mhx commented 1 year ago

mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_img -o ./old_raptor_withzeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G

Time: [15.67h] Size: 10.31 GiB

mkdwarfs -l 4 -B 200 -L 150G -i ./old_raptor_tar -o ./old_raptor_nozeros.dwarfs -S 30 --num-scanner-workers 48 --max-similarity-size 10G

Time: [23.66m] Size: 10.27 GiB

Just to confirm: in the run above that takes 15+ hours, it is actually also slow during the zeroes part, right?

I just ran another test locally with a file comprised of 20 GiB of random data followed by 230 GiB of zeroes. Using more or less the same options you were using, mkdwarfs finishes in 20 minutes and zips through the final 230 GiB at almost 1 GiB/s.

On StableSDA.img segmentation matches: good=3,863,292, bad=100,736,283, total=9,279,638,696 On StableSDA.tar segmentation matches: good=997,506, bad=76,669,434, total=160,944,085

This is really interesting. A total of 9 G matches and only 100 M bad matches and 4 M good matches means there were actually a lot of good matches of which only the best was kept. For my test with the 250 GiB file, I see:

segmentation matches: good=60,278,402, bad=204,874,345, total=265,152,747

Right now I'm really not sure what to make of this.

mhx commented 1 year ago

I'm pretty certain that I know what the problem is.

I can craft a small file of only a few megabytes that, if I put it in front of hundreds of megabytes of only zeroes, will make the segmenter grind to a halt during the section of zeroes. It'll run at a few hundred kilobytes per second.

Here's how to build the prefix file:

out=testzero-big

dd if=/dev/random of=match bs=5000 count=1
dd if=/dev/random of=suffix bs=50 count=1
dd if=/dev/zero of=zero bs=3000 count=1

cat match suffix >$out
for i in $(seq 1000); do
  cat zero match >>$out
done

So the file is:

(match)(suffix)(zero)(match)(zero)(match)(zero)...

This file only works with the default 4 KiB match window size (-W12 -w3). Note how match is longer than the match window and zero is shorter.

Here's what happens:

Up until the second match, the segmenter doesn't find any matching segments, so the data is output unmodified. Note that the output ends with 3000 zero bytes.
Now the segmenter finds a copy of match at the beginning of the output, so it just adds a reference. The output remains unmodified and still ends with 3000 zero bytes.
The segmenter cannot find the next sequence of 3000 zero bytes, because it's smaller than it's match window size, so the 3000 bytes are added to the output, which now ends with 6000 zero bytes.
The last two steps repeat another 999 times, at the end of which there will be nearly 3 MiB of zeroes at the end of the output.

While this doesn't initially sound like a problem, consider this: as soon as data is added to the output, the segmenter will compute and store a cyclic hash value every 512 bytes (the default window step size). For the whole sequence of 3 million zeroes, that's almost 6,000 hash values. Unfortunately, the values are all the same and collide in the lookup table. Collisions are typically rare, but in this case, it means that as soon as it hits a long sequence of zeroes after the prefix block, whenever it finds a match (which is always), the segmenter will try to find the "best" (i.e. longest) match. In order to do so, it'll have to traverse the list if 6,000 colliding candidates, check each candidate by comparing the 4KiB of memory, and then find the longest match in the results.

And I'm convinced this is what happens with your image, too. I think the pattern (match)(zero)(match)(zero)... can occur frequently in binaries and/or file systems. And it's not limited to a single (match) sequence, each input that produces long sequences of zeroes in the output will add to the collisions in the match table. Coming back to this output:

On StableSDA.img segmentation matches: good=3,863,292, bad=100,736,283, total=9,279,638,696
On StableSDA.tar segmentation matches: good=997,506, bad=76,669,434, total=160,944,085

It looks like there are about 90 such collisions present, and this is already enough to cause a very major slowdown.

Now, this doesn't only happen with zeros, of course, any repeating sequence will do. And the collisions aren't even a problem unless at some point most of the file turns out to be this one repeating sequence.

I've got a few ideas for how to address this problem, I'll post here once I've had some time to think about it a bit more.

xcfmc commented 1 year ago

Awesome, thanks for diving into it at this level. I suspected it was doing something like that. There are spots in compressing the .img file where it did jump rapidly ahead (likely a big zero region in the middle) but then it crawled on data regions after that.

For what it's worth, I tried the command line you suggested in a previous message:

mkdwarfs -l 6 -B 8192 -L 64G --order=path -i ./old_raptor -o ./test.dwarfs -W 12 -w 1 --bloom-filter-size=1

It's at 95% now, and has been running for almost 24 hours. Throughput is at 250KiB/s with only 6% CPU usage. I'm going to let it finish just to see what the final file size is. I think you're already on to the root cause though.

I wish I had the time to jump into the code and attempt to contribute. My dream archiver would have a 'compress further' feature, where speed wouldn't matter. First compress would be quick, with large block matching. Then a background process could sub-chunk, dedupe, and further shrink a whole collection of archives, using idle cycles. Lot's of archivers have a 'recompress' feature, but they just restart from scratch, rather than sub-chunking and keeping all the work that went into deduping the larger blocks. Just a thought :)

mhx commented 1 year ago

There are spots in compressing the .img file where it did jump rapidly ahead (likely a big zero region in the middle) but then it crawled on data regions after that.

That's actually quite normal. Whenever the segmenter finds a match, it can immediately skip ahead <match-length> bytes instead of just a single byte in each iteration.

In the meantime, I've made the following change:

When starting the segmenter, it builds a small hash table containing the cyclic hash value for all 256 sequences of repeating byte values for its window size.
Before a new (hash, offset) pair gets added to the segmenter lookup table, it checks if the hash value is one of the "repeating sequence" values and the lookup table already contains that hash value and the region it points to is indeed the same repeating sequence.

This doesn't catch all cases that could trigger the collisions (e.g. any long sequence that repeats every power-of-two bytes could still trigger it), but unless you explicitly craft such data, collisions should really be rare. I could even catch these cases if necessary.

Anyway, the results look pretty good:

# ~/dwarfs-universal-0.7.0-Linux-x86_64 --tool=mkdwarfs -i xxx/ -o xxx.dwarfs --order=path
I 18:54:02.980279 filesystem created without errors [2.38m]

# ~/mkdwarfs -i xxx/ -o xxx.dwarfs --order=path
I 18:50:58.611213 filesystem created without errors [146.7ms]

xcfmc commented 1 year ago

Thanks for the fast fix! I'll try it out now. I've been using the compiled releases up to this point, but I'll do a check out and build so I can stay up to date on commits. I'll do a follow up post with results.

mhx commented 1 year ago

Thanks for the fast fix! I'll try it out now. I've been using the compiled releases up to this point, but I'll do a check out and build so I can stay up to date on commits. I'll do a follow up post with results.

Oh, sorry, I didn't push that to github yet!

I've been working on the segmenter for the past few weeks (still not finished) and I've decided to add this feature on top of the new code, which is still far from being ready. I'll backport it if I need to, but I'd rather not.

If you're feeling extremely adventurous, you can try building the mhx/categorizer branch, which I've just updated and which contains the change alongside a ton of new code. Expect crashes (definitely) and corrupt images (hopefully not).

If you want, you can also try adding:

--categorize=incompressible -C incompressible::null --incompressible-fragments

I'd be interested to know if this speeds things up, especially with higher compression levels. (What this does: it performs an initial scan of the data to identify incompressible fragments, which will still be subject to segmenting, but no longer to compression. Also, the segmenting happens in a separate thread, so it might be faster if there's enough already compressed data in the input.)

Oh, and let me know if you'd rather have a static binary to play with. These drop out of my CI pipeline after every commit anyway.

mhx commented 1 year ago

Oh, you'll need clang-16 to build the branch at the moment. The CI just told me that clang-15 is broken because I've apparently used C++ features that are way too new...

mhx commented 1 year ago

Fixed building with older clang versions.

xcfmc commented 1 year ago

Cool, I'll try it. I'm still running on Kubuntu 21.04, and the latest apt version is clang-12. If it works under that I'll be good to go. Otherwise I'll ask for a static binary. I'm planning to upgrade to 23.04 in the next couple days, which should have the newer clang. I'm sure I could manually get this up to clang-16, but I'd rather nuke this system and update all at once to 23.04.

mhx commented 1 year ago

Here you go: dwarfs-universal-0.7.2-130-g254bf41-Linux-x86_64.gz

xcfmc commented 1 year ago

I was able to compile it, and I'm running this command now:

~/git/dwarfs/build/mkdwarfs -l 6 -B 8192 -L 64G --order=path -i ./old_raptor -o ./cat_test.dwarfs -W 12 -w 1 --bloom-filter-size=1

It's at 54% now, but the data rate has fallen to 1MiB/s at 7% CPU use, so it could be a while before I can report on a before/after time. The initial rate was 10-30Mib/s with ~20% CPU use.

You mentioned earlier about fitting into the CPU cache... What parameters control that? And how would I calculate the right parameters for my cache size? I have 30MB cache on each E5-2697 v2 CPU. When I was using this for monero mining, there was a similar issue about fitting into cache. They tune it down to fewer cores in order to fit the whole thing in cache, which resulted in a higher hash rate. Maybe I need to reduce cores and tweak compression parameters to keep the data rate up.

mhx commented 1 year ago

It's at 54% now, but the data rate has fallen to 1MiB/s at 7% CPU use, so it could be a while before I can report on a before/after time. The initial rate was 10-30Mib/s with ~20% CPU use.

Mmmmh, are you talking about 7% / 20% CPU use across all cores or just one core?

I would expect the new code to run almost as fast as the old code did on the "sparse" image you tested it with.

xcfmc commented 1 year ago

The 7% / 20% was the total combined CPU usage. I ran it again from the start, to see the breakdown of individual CPU usage, and it looks like only 4 or 5 CPUs go into 80%+ usage, and the rest are idle. It slows down to 100KiB/s towards the end, with almost no CPU usage (usually one core intermittently spikes to 100%).

I have results for the categorizer version... It took 24.84h and compressed to 9.896GiB. The normal release version took 28.22h and compressed to 9.882GiB. If this helps, here is the full output:

~/git/dwarfs/build/mkdwarfs -l 6 -B 8192 -L 64G --order=path -i ./old_raptor -o ./cat_test.dwarfs -W 12 -w 1 --bloom-filter-size=1 I 21:01:14.565170 scanning "/home/v/hdd/old_raptor" I 21:01:14.565381 assigning directory and link inodes... I 21:01:14.565442 waiting for background scanners... I 21:01:14.565530 scanning CPU time: 1.242ms I 21:01:14.565581 finalizing file inodes... I 21:01:14.565637 finalized 0 hardlinks [1.705us] I 21:01:14.565695 finalized 1 unique files [6.292us] I 21:01:14.565752 finalized 0 non-unique files [1.153us] I 21:01:14.565818 saved 0 B / 222.5 GiB in 0/1 duplicate files I 21:01:14.565887 assigning device inodes... I 21:01:14.565951 assigning pipe/socket inodes... I 21:01:14.566004 building metadata... I 21:01:14.566061 building blocks... I 21:01:14.566205 saving names and symlinks... I 21:01:14.566298 updating name and link indices... I 21:01:14.571599 waiting for segmenting/blockifying to finish... I 21:01:14.571735 ordering 1 inodes by path name... I 21:01:14.571862 1 inodes ordered [25.26us, 29.78us CPU] I 21:01:14.572018 total ordering CPU time: 1.231ms I 21:01:14.584151 using a 4 KiB window at 2 KiB steps for segment analysis I 21:01:14.584238 bloom filter size: 16 MiB I 21:51:32.680583 bloom filter reject rate: 94.512% (TPR=3.772%, lookups=21,574,530,762) I 21:51:32.680747 segmentation matches: good=5,491,785, bad=56,388,213, collisions=528,945,906, total=590,825,904 I 21:51:32.680800 segmentation collisions: L1=0.059%, L2=0.001% [10,089,138 hashes] I 21:51:32.680860 avoided 2 collisions in 0x19-byte sequences I 21:51:32.680910 avoided 2 collisions in 0x04-byte sequences I 21:51:32.680981 avoided 1 collisions in 0x03-byte sequences I 21:51:32.681029 avoided 1 collisions in 0x10-byte sequences I 21:51:32.681078 avoided 2 collisions in 0x01-byte sequences I 21:51:32.681124 avoided 1 collisions in 0x0e-byte sequences I 21:51:32.681170 avoided 1883 collisions in 0x00-byte sequences I 21:51:36.467731 segmenting [24.84h, 24.82h CPU] I 21:51:36.468086 total segmenting CPU time: 24.82h I 21:51:36.479162 saving chunks... I 21:51:36.600487 saving directories... I 21:51:36.600621 saving shared files table... I 21:51:36.602158 saving names table... [1.476ms] I 21:51:36.602212 saving symlinks table... [1.371us] I 21:51:37.853838 waiting for compression to finish... I 21:51:38.049752 compressed 222.5 GiB to 9.896 GiB (ratio=0.0444751) I 21:51:38.062183 compression CPU time: 2.3h I 21:51:38.062282 filesystem created without errors [24.84h] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ waiting for block compression to finish ███████████████████████████████████████████████████████████████▏ 0 B/s 1 dirs, 0/0 soft/hard links, 1/1 files, 0 other original size: 222.5 GiB, scanned: 0 B, hashed: 0 B (0 files) saved by deduplication: 0 B (0 files), saved by segmenting: 203.3 GiB filesystem: 19.25 GiB in 1,232 blocks (6,151,139 chunks, 1/1 fragments, 1 inodes) compressed filesystem: 1,232 blocks/9.896 GiB written

Let me know if there are any artificial tests I can run to help out. Maybe there is something unique about my CPU's, or my linux settings. I might try your dd random test with a 222GiB image and see what that does... and also test it with fewer cores, to see if this comes back to a cache fit issue.

xcfmc commented 1 year ago

This was done with the copy that I compiled. I'll try your static build, in case the compile on my end introduced something weird. Here's the output of mine, to confirm it is the right build:

~/git/dwarfs/build/mkdwarfs

|   \__ __ ____ _ _ _| __/ __|         Deduplicating Warp-speed
| |) \ V  V / _` | '_| _|\__ \      Advanced Read-only File System
|___/ \_/\_/\__,_|_| |_| |___/         by Marcus Holland-Moritz

mkdwarfs (v0.7.2-130-g254bf41 on branch mhx/categorizer) built on x86_64, Linux-5.11.0-49-generic, GNU 11.1.0

Usage: mkdwarfs [OPTIONS...]

Options: -i [ --input ] arg path to root directory or source filesystem --input-list arg file containing list of paths relative to root directory -o [ --output ] arg filesystem output name -f [ --force ] force overwrite of existing output image -l [ --compress-level ] arg (=7) compression level (0=fast, 9=best, please see man page for details) --log-level arg (=info) log level (error, warn, info, debug, trace) -h [ --help ] output help message and exit -H [ --long-help ] output full help message and exit

mhx commented 1 year ago

Sorry to keep iterating on this...

Could you please try my binary along with (more or less) your original set of options:

mkdwarfs -l 1 -B 200 -L 150G -S 30 -W 12 -w 1 --order=path

I'm pretty certain now that the set of options I suggested at some point in this thread don't work well for your use case, so I'd like to see if there's an improvement compared to the command that you originally ran. I've just left out all options that definitely don't make a difference.

xcfmc commented 1 year ago

Absolutely... glad to help! I'll abort the last test and run the new line now. I did notice higher CPU usage on your build (30% instead of 20%), but I think the parameter change will tell us more. I'll post an update when it's done.

xcfmc commented 1 year ago

That went fast, 29.8m to compress to 12.77 GiB:

~/dwarfs-universal-0.7.2-130-g254bf41-Linux-x86_64 --tool=mkdwarfs -l 1 -B 200 -L 150G -S 30 -W 12 -w 1 --order=path -i ./old_raptor -o ./cat_test2.dwarfs I 17:47:46.348546 scanning "/home/v/hdd/old_raptor" I 17:47:46.348726 assigning directory and link inodes... I 17:47:46.348775 waiting for background scanners... I 17:47:46.348867 scanning CPU time: 2.273ms I 17:47:46.348926 finalizing file inodes... I 17:47:46.348969 finalized 0 hardlinks [920ns] I 17:47:46.349014 finalized 1 unique files [2.919us] I 17:47:46.349058 finalized 0 non-unique files [547ns] I 17:47:46.349101 saved 0 B / 222.5 GiB in 0/1 duplicate files I 17:47:46.349140 assigning device inodes... I 17:47:46.349178 assigning pipe/socket inodes... I 17:47:46.349220 building metadata... I 17:47:46.349262 building blocks... I 17:47:46.349398 saving names and symlinks... I 17:47:46.349484 updating name and link indices... I 17:47:46.354545 waiting for segmenting/blockifying to finish... I 17:47:46.354686 ordering 1 inodes by path name... I 17:47:46.354778 1 inodes ordered [22.33us, 27.02us CPU] I 17:47:46.354880 total ordering CPU time: 1.461ms I 17:47:46.498157 using a 4 KiB window at 2 KiB steps for segment analysis I 17:47:46.498241 bloom filter size: 256 MiB I 18:17:28.103587 bloom filter reject rate: 99.546% (TPR=59.951%, lookups=21784483621) I 18:17:28.103667 segmentation matches: good=19508109, bad=130385584, collisions=303873146, total=453766839 I 18:17:28.103720 segmentation collisions: L1=0.069%, L2=0.001% [10152290 hashes] I 18:17:28.103808 avoided 2 collisions in 0x04-byte sequences I 18:17:28.103859 avoided 1 collisions in 0x03-byte sequences I 18:17:28.103909 avoided 1 collisions in 0x10-byte sequences I 18:17:28.103957 avoided 1 collisions in 0x01-byte sequences I 18:17:28.104002 avoided 1 collisions in 0x0e-byte sequences I 18:17:28.104058 avoided 11414 collisions in 0x00-byte sequences I 18:17:29.855828 segmenting [29.73m, 29.66m CPU] I 18:17:29.856095 total segmenting CPU time: 29.66m I 18:17:29.860038 saving chunks... I 18:17:30.283030 saving directories... I 18:17:30.283157 saving shared files table... I 18:17:30.284596 saving names table... [1.369ms] I 18:17:30.284650 saving symlinks table... [1.206us] I 18:17:33.667032 waiting for compression to finish... I 18:17:34.348834 compressed 222.5 GiB to 12.77 GiB (ratio=0.0573964) I 18:17:34.404042 compression CPU time: 1.252m I 18:17:34.404202 filesystem created without errors [29.8m] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ waiting for block compression to finish ███████████████████████████████████████████████████████████████▏ 0 B/s 1 dirs, 0/0 soft/hard links, 1/1 files, 0 other original size: 222.5 GiB, scanned: 0 B, hashed: 0 B (0 files) saved by deduplication: 0 B (0 files), saved by segmenting: 203.1 GiB filesystem: 19.39 GiB in 20 blocks (20189610 chunks, 1/1 fragments, 1 inodes) compressed filesystem: 20 blocks/12.77 GiB written

The big differences I see:

-l 1 -B 200 -L 150G -S 30 -W 12 -w 1 --order=path

11414 collisions in 0x00 bloom filter size: 256 MiB segmentation matches: good=19,508,109, bad=130,385,584, collisions=303,873,146, total=453,766,839 12.77 GiB (ratio=0.0573964) [29.8m]

-l 6 -B 8192 -L 64G --order=path -W 12 -w 1 --bloom-filter-size=1

1883 collisions in 0x00 bloom filter size: 16 MiB segmentation matches: good=5,491,785, bad=56,388,213, collisions=528,945,906, total=590,825,904 9.896 GiB (ratio=0.0444751) [24.84h]

In my earlier tests, it seemed that the largest block size (-S 30) is what brought the final size down the most. the -l 6 dropped block size to 24. I'm going to retry that command, with -S 30 added, to see what we get.

Actually... I think I'll retry the new command with a smaller bloom filter and larger bloom filter to see how it effects speed and size.

mhx commented 1 year ago

That went fast, 29.8m to compress to 12.77 GiB:

At last!

I believe what I missed in my previous analysis of bloom filter size vs. cache size is the deep lookback. There's one "main" bloom filter that sits in front of the block-specific ones. If that filter indicates that there might be a match, all blocks need to be checked, one after another. If that first filter is large enough, you'll get great performance.

Actually... I think I'll retry the new command with a smaller bloom filter and larger bloom filter to see how it effects speed and size.

Smaller will very likely tank the performance. Going a bit larger might still improve performance slightly.

Can you also please try the following options:

-l 6 -B 200 -L 150G -S 30 -W 12 -w 1 --order=path --categorize=incompressible -C incompressible::null --incompressible-fragments

Thanks!

xcfmc commented 1 year ago

That performed really well! Compressed down to 9.891 GiB in 43.49m. That's even smaller than the 24.84h run at 9.896 GiB.

What do those switches do?

Full output here:

~/dwarfs-universal-0.7.2-130-g254bf41-Linux-x86_64 --tool=mkdwarfs -l 6 -B 200 -L 150G -S 30 -W 12 -w 1 --order=path --categorize=incompressible -C incompressible::null --incompressible-fragments -i ./old_raptor -o ./cat_test2.dwarfs I 00:23:42.240434 scanning "/home/v/hdd/old_raptor" I 00:23:42.240590 assigning directory and link inodes... I 00:23:42.240660 waiting for background scanners... I 00:30:55.947896 scanning CPU time: 7.19m I 00:30:55.947989 finalizing file inodes... I 00:30:55.948046 finalized 0 hardlinks [5.846us] I 00:30:55.948105 finalized 1 unique files [11.82us] I 00:30:55.948155 finalized 0 non-unique files [2.394us] I 00:30:55.948206 saved 0 B / 222.5 GiB in 0/1 duplicate files I 00:30:55.948274 366 fragments I 00:30:55.948334 365 incompressible fragments I 00:30:55.948418 I 00:30:55.948463 incompressible I 00:30:55.948508 assigning device inodes... I 00:30:55.948552 assigning pipe/socket inodes... I 00:30:55.948598 building metadata... I 00:30:55.948650 building blocks... I 00:30:55.948883 saving names and symlinks... I 00:30:55.948993 updating name and link indices... I 00:30:55.953933 waiting for segmenting/blockifying to finish... I 00:30:55.954081 [incompressible] ordering 1 inodes by path name... I 00:30:55.954107 [] ordering 1 inodes by path name... I 00:30:55.954192 [incompressible] 1 inodes ordered [41.36us, 43.44us CPU] I 00:30:55.954239 [] 1 inodes ordered [20.95us, 25.09us CPU] I 00:30:55.954381 total ordering CPU time: 2.443ms I 00:30:56.124720 using a 4 KiB window at 2 KiB steps for segment analysis I 00:30:56.124804 bloom filter size: 256 MiB I 00:30:56.125498 using a 4 KiB window at 2 KiB steps for segment analysis I 00:30:56.125576 bloom filter size: 256 MiB I 00:31:57.988088 bloom filter reject rate: 99.932% (TPR=50.270%, lookups=1606281633) I 00:31:57.988186 segmentation matches: good=4626, bad=541343, collisions=0, total=545969 I 00:31:57.988265 segmentation collisions: L1=0.016%, L2=0.000% [782525 hashes] I 00:31:58.115884 [incompressible] segmenting [1.036m, 1.036m CPU] I 01:00:37.968195 bloom filter reject rate: 99.565% (TPR=63.174%, lookups=20297569960) I 01:00:37.968304 segmentation matches: good=23085325, bad=141625812, collisions=342486243, total=507197380 I 01:00:37.968383 segmentation collisions: L1=0.074%, L2=0.002% [9445565 hashes] I 01:00:37.968477 avoided 2 collisions in 0x04-byte sequences I 01:00:37.968534 avoided 1 collisions in 0x03-byte sequences I 01:00:37.968600 avoided 1 collisions in 0x10-byte sequences I 01:00:37.968648 avoided 2 collisions in 0x01-byte sequences I 01:00:37.968695 avoided 1 collisions in 0x0e-byte sequences I 01:00:37.968742 avoided 7676 collisions in 0x00-byte sequences I 01:00:39.856038 [] segmenting [29.73m, 29.64m CPU] I 01:00:39.856407 total segmenting CPU time: 30.67m I 01:00:39.860614 saving chunks... I 01:00:40.571103 saving directories... I 01:00:40.571232 saving shared files table... I 01:00:40.572739 saving names table... [1.442ms] I 01:00:40.572803 saving symlinks table... [1.406us] I 01:00:44.708326 waiting for compression to finish... I 01:07:11.411533 compressed 222.5 GiB to 9.891 GiB (ratio=0.0444548) I 01:07:11.476596 compression CPU time: 3.03h I 01:07:11.476742 filesystem created without errors [43.49m] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ waiting for block compression to finish ███████████████████████████████████████████████████████████████▏ 0 B/s 1 dirs, 0/0 soft/hard links, 1/1 files, 0 other original size: 222.5 GiB, scanned: 222.5 GiB, hashed: 0 B (0 files) saved by deduplication: 0 B (0 files), saved by segmenting: 203 GiB filesystem: 19.52 GiB in 21 blocks (23761818 chunks, 731/731 fragments, 1 inodes) compressed filesystem: 21 blocks/9.891 GiB written

mhx commented 1 year ago

That performed really well! Compressed down to 9.891 GiB in 43.49m. That's even smaller than the 24.84h run at 9.896 GiB.

Phew! :-)

Yeah, this is roughly what I expected.

I 01:00:39.856407 total segmenting CPU time: 30.67m

This is comparable to the earlier run.

I 01:07:11.476596 compression CPU time: 3.03h

This is where this version spent quite a bit of time, squeezing out a few extra GiB.

What do those switches do?

They're all part of the new categorizer feature. --categorize=incompressible enables the incompressible categorizer, which performs a (surprisingly fast) scan of the data to determine if it's compressible or not. This is usually applied to whole files, but with the --incompressible-fragments switch, it'll generate a list of "fragments" that are either labelled as "" or as "incompressible". A few options can now be specified per-category, so with -C incompressible::null we set compression to null (i.e. no compression) for all fragments labelled "incompressible". If you have a lot of such incompressible data, building the file system image should be faster (as less blocks have to be compressed) and using the image should also be faster (because the uncompressed blocks can be used directly by just memory-mapping them instead of having to decompress them). Ideally, the resulting images should be roughly the same size. In practice, they can be smaller (because the compressible data is clustered more closely), or larger (because the "incompressible" detection is not perfect).

There are going to be more categorizers than just incompressible, but the only other functioning one right now is pcmaudio, which detects PCM audio files and compresses them using FLAC.

One nice side-effect of the categorizers is that each category gets its own segmenter, all of which can run in parallel. (But this isn't yet the parallelization that I have in mind for the segmenters.)

M-Gonzalo commented 1 year ago

Dumb question, but have you tried something like -W8? In my tests, I have found it works best for binaries, like what you're trying to compress. For NodeJS projects, OTOH, it's best to use -W12, for example. By 'works best' I mean it can detect way more duplicates, so the resulting file is smaller and gets created faster.

xcfmc commented 1 year ago

I'll give that a shot. For some reason I thought -W only went down to 12... Actually, I think I did try going below 12 and there was a read performance or mount issue. Mounting with -o mlock=must was failing on certain compression parameters, and I think going under 12 triggered that. In any case, I'll test it out.

mhx commented 1 year ago

Mounting with -o mlock=must was failing on certain compression parameters, and I think going under 12 triggered that. In any case, I'll test it out.

What's the size of your metadata? Can you post the output of running dwarfsck on your image?

xcfmc commented 1 year ago

I'll try remounting my older archives to see if I can find one that does that. Originally I had a script that would try mlock=must and then fallback to no mlock, then I discovered the mlock=try option, so I haven't seen the error in a while. Since I have 256GB of RAM, I use these mount parameters:

-o allow_other -o auto_unmount -o cache_image -o cache_files -o cachesize=200g -o workers=48 -o mlock=try -o large_read -o splice_read -o decratio=0

After benchmarking, they all seem to improve performance enough to keep them in there.

As for the test with W8, I ended up aborting that after 24 hours. This was a test on a different set of drive images (two different installs of LinuxCNC on 60GB SSD's). Compression with W12 went in about 1 hour, but at W8 it was at 30GB after 24 hours, with 10% left to go, and the W12 archive was 35GB completed, so it didn't look like there was much savings in the cards there.

I may have deleted or recompressed the ones that had the mlock fail... If so, I'll attempt to recreate it. It will likely take a 24h+ compression to get that result, so I'll have to give an update on that tomorrow.

M-Gonzalo commented 1 year ago

Yes, you can usually compare the first couple of blocks to see how much does or doesn't save the difference in window size. Try playing around with it. it'll depend on your data.

xcfmc commented 1 year ago

The m8 archive finished and it mounted without the mlock error. Unfortunately, I didn't save the archive that gave me that error. I set my script back to mlock=must, so if an archive fails to mount in the future, I'll see it and report it.

With m12 uncompressed metadata size: 72.65 MiB, 9,095,435 chunks, 34.91 GiB (74.87%) With m8 uncompressed metadata size: 553.2 MiB, 71,392,731 chunks, 34.68 GiB (81.62%)

M-Gonzalo commented 1 year ago

Yes, you can test it in increments of 1. It's usually pretty obvious the moment the smaller window gives dwarfs an edge, you see the "saved by segmenting" section explode.

mhx commented 11 months ago

With m12 uncompressed metadata size: 72.65 MiB, 9,095,435 chunks, 34.91 GiB (74.87%) With m8 uncompressed metadata size: 553.2 MiB, 71,392,731 chunks, 34.68 GiB (81.62%)

What's the percentage at the end? There's quite a difference in percentage, but the image sizes seem pretty close.

That being said, the gain in terms of image size seems rather small, so I'd probably stick to the larger block size.

Phantop commented 10 months ago

What's the state of this possibly making it into the main branch? The flac compression seems really useful, and I was wondering whether any consideration could be given to using jxl for jpeg compression, in a similar manner?

mhx commented 10 months ago

What's the state of this possibly making it into the main branch?

It'll make it into the next release. Merging to main is blocked on a few issues that I need to find solutions for and I currently have little time to work on.

The flac compression seems really useful, and I was wondering whether any consideration could be given to using jxl for jpeg compression, in a similar manner?

There's a subtle difference that makes FLAC much easier to integrate. Audio data can easily be merged/split into blocks, each of which can be independently compressed. So the FLAC integration works on the block level, just like any other compression in DwarFS. For jxl, DwarFS needs the ability to apply compression/decompression on the file level. It's on the roadmap, but not for the next release.

mhx commented 8 months ago

If you want to play with the new features, the CI build workflow now automatically uploads the binary artifacts (universal binaries as well as the regular binary tarball) for each build. These are definitely not "release quality", but it's still good to give them wider exposure to find bugs early.

mhx commented 7 months ago

Fixed in v0.8.0.