mhx / dwarfs

A fast high compression read-only file system for Linux, Windows and macOS
GNU General Public License v3.0
2.07k stars 54 forks source link

Support bzip3 compression #84

Closed kspalaiologos closed 1 year ago

kspalaiologos commented 2 years ago

Over the last two weeks I've been working on bzip3, which is ultimately intended to replace bzip2. I believe that it'd be a good addition to dwarfs judging by the compression ratios it offers and its fairly good performance.

To mimic your benchmark in a way, first, I have downloaded every version of Perl5 ever released and decompressed them.

% wget -r -l1 -nH --cut-dirs=2 --no-parent -A.tar.gz --no-directories https://www.cpan.org/src/5.0/
% for g in *.gz; do gunzip $g; done
% ls -la | wc -l
262

Then, I put all the resulting .tar files in a single .tar file and tried to compress it using various compressors:

xz -T16 -9 -k all.tar  10829.91s user 26.91s system 1488% cpu 14658M memory 12:09.24 total
bzip2 -9 -k all.tar  981.78s user 9.77s system 95% cpu 8M memory 17:16.64 total
bzip3 -e -b 256 -j 12 all.tar  2713.81s user 16.28s system 634% cpu 18301M memory 7:10.10 total
zstd -T12 -16 all.tar  4162.94s user 16.40s system 1056% cpu 687M memory 6:35.62 total

The results follow:

Then, I used lrzip to perform long-range deduplication on the original .tar file:

% time lrzip -n -o all_none.tar.lrz all.tar
546.17s user 160.87s system 102% cpu 10970M memory 11:28.00 total

% time lrzip --lzma -o all_lzma.tar.lrz all.tar
702.16s user 161.87s system 122% cpu 10792M memory 11:44.83 total

% time lrzip -b -o all_bzip2.tar.lrz all.tar
563.93s user 147.38s system 112% cpu 10970M memory 10:34.10 total

Finally, I compressed the resulting none.tar.lrz file using bzip3:

% time bzip3 -e -b 256 -j 2 all_none.tar.lrz
32.05s user 0.76s system 146% cpu 2751M memory 22.411 total

The results follow:

I believe that the long-range deduping that dwarfs offers coupled together with block-level deduping that is implanted into bzip3 and a powerful BWT entropy post-coder would excel in particular on source code, text, various documents, etc...

I have also tested dwarfs on the directory with tarballs yielding the following result:

% time mkdwarfs -i perls -o perls.dwarfs
15357.35s user 54.26s system 1808% cpu 7871M memory 14:12.21 total
% wc -c perls.dwarfs
2911394490 perls.dwarfs

In retrospect, I should have unpacked all the tarballs, but lrzip seems to have done a great job at deduping without this. Maybe this functionality could be implemented into dwarfs at some point in the future too.

mhx commented 2 years ago

Thanks, this looks really interesting! I'll take a closer look as soon as I have some time.

bdnbje commented 2 years ago

Testing bzip3 on a set of binary data: bzip3

time bzip3 -e -f -b 511 -j 2  mame.tar mame.tar.bz3

real    3m56.233s
user    7m21.964s
sys 0m4.543s

LZMA

time 7z a -t7z -m0=lzma:d=512m:fb=273:mc=1000000000:lc=8 mame.tar.7z mame.tar

7-Zip (a) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
 64-bit locale=en_US.UTF-8 Threads:16, ASM

Scanning the drive:
1 file, 4218286080 bytes (4023 MiB)

Creating archive: mame.tar.7z

Add new data to archive: 1 file, 4218286080 bytes (4023 MiB)

Files read from disk: 1
Archive size: 1411638532 bytes (1347 MiB)
Everything is Ok

real    29m54.570s
user    45m24.571s
sys 0m10.005s

sizes of the files:

du mame*
4119420 mame.tar
1378556 mame.tar.7z
1629692 mame.tar.bz3

lzma has specific tricks that will always make it smaller in binary data and faster to decode than bzip3. the only time bzip3 might win is text data but even that could be beat by using lc8:lp0:pb0 . bzip3 is not a good compression algorithm for a fs since its not assymetrical i.e it will require the same time to decompress as to compress whereas lzma is 10x faster to decode than to encode. xz which is a derivative of lzma2 is a far simpler version from the lzma2 in igor's sdk lzma the one i used here (LZMA1) much ideal since it allows more of the settings to be used.The default llevel 9 of the xz is not optimal for all data. decode of lzma:

time 7z t mame.tar.7z

7-Zip (a) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
 64-bit locale=en_US.UTF-8 Threads:16, ASM

Scanning the drive for archives:
1 file, 1411638532 bytes (1347 MiB)

Testing archive: mame.tar.7z
--
Path = mame.tar.7z
Type = 7z
Physical Size = 1411638532
Headers Size = 122
Method = LZMA:29
Solid = -
Blocks = 1

Everything is Ok

Size:       4218286080
Compressed: 1411638532

real    0m46.479s
user    0m45.334s
sys 0m0.553s
kspalaiologos commented 2 years ago

I beg to differ. I used your flags on a Wolfram Mathematica 12.0 install which is a fairly diverse corpus (a few Java runtimes, a bunch of executables/dlls, some Wolfram language code), and this is what I got:

% time bzip3 -e -b 511 -j 4 mathematica.tar
bzip3 -e -b 511 -j 4 mathematica.tar  1465.57s user 10.40s system 343% cpu 12247M memory 7:09.20 total
% time 7z a -t7z -m0=lzma:d=512m:fb=273:mc=1000000000:lc=8 mathematica.tar.7z mathematica.tar

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,24 CPUs AMD Ryzen 9 3900X 12-Core Processor             (870F10),ASM,AES-NI)

Scanning the drive:
1 file, 10487152640 bytes (10002 MiB)

Creating archive: mathematica.tar.7z

Items to compress: 1

Files read from disk: 1
Archive size: 3151839730 bytes (3006 MiB)
Everything is Ok
7z a -t7z -m0=lzma:d=512m:fb=273:mc=1000000000:lc=8 mathematica.tar.7z   6927.57s user 20.74s system 144% cpu 5388M memory 1:20:10.82 total
% time pbzip2 -b500 -k -f mathematica.tar
pbzip2 -b500 -k -f mathematica.tar  1663.41s user 34.60s system 2162% cpu 3142M memory 1:18.53 total

% time zpaq a mathematica.zpaq mathematica.tar -t24 -m4
zpaq a mathematica.zpaq mathematica.tar -t24 -m4  15070.48s user 144.42s system 2190% cpu 11961M memory 11:34.58 total

% du -h mathematica*
9.8G    mathematica.tar
3.0G    mathematica.tar.7z
4.2G    mathematica.tar.bz2
3.4G    mathematica.tar.bz3
3.1G    mathematica.zpaq

LZMA with your flags took one and a half of an hour to compress 9GiB of data, so by an educated guess, it would probably take more than 3 hours to compress the dataset above. Secondly, bzip3 supports parallel decompression, just like bzip2, because all blocks are independent. Compared to the bzip3 timing of 7 minutes, LZMA is almost 12 times slower.

I don't think it was worth to spend my entire morning on watching paint dry, but there's more - if we wanted a more fair comparison, we could apply some deduping to the initial data (which was presumably handled before by LZMA, as it's a LZ derivative):

% time lrzip -n -q -T mathematica.tar
Output filename is: mathematica.tar.lrz
mathematica.tar - Compression Ratio: 1.360. Average Compression Speed: 84.042MB/s.
Total time: 00:01:59.01
lrzip -n -q -T mathematica.tar  95.39s user 62.06s system 132% cpu 17428M memory 1:59.02 total

% bzip3 -f -e -b 511 -j 4 mathematica.tar.lrz
bzip3 -f -e -b 511 -j 4 mathematica.tar.lrz  1261.14s user 10.08s system 337% cpu 12250M memory 6:17.19 total

# Because I don't have much more time this morning to waste on the other LZMA frontend:
% xz -8 --threads=16 --extreme mathematica.tar.lrz
xz -8 --threads=16 --extreme mathematica.tar.lrz  5121.02s user 12.29s system 1522% cpu 9131M memory 5:37.06 total

% du -h mathematica*
9.8G    mathematica.tar
3.1G    mathematica.tar.lrz.bz3
3.0G    mathematica.tar.lrz.xz

Which proves my point that bzip3 performs pretty well with the aid of long range deduping - bzip3 isn't exactly a LZ derivative, so to have a fair benchmark on these kinds of data, we'd need to apply some sort of dictionary coding.

Another thing (that feels painfully obvious to me) I would like to point out that no compression algorithm is the philosopher's stone of computing that "turns lead into gold" under all circumstances. Hence bzip3, as a very different conceptually compressor from everything dwarfs supports could be quite beneficial to include. And it's not like we have to compress all blocks using the same compressor in the end.

Finally, regarding .lrz decompression times, xz takes 2:44.45 wall clock time, while bzip3 -j4 takes 4:30.55 which isn't quite symmetric, but I can admit that it's a bit too slow.

Phantop commented 2 years ago

@kspalaiologos, I would like to pipe in and note, however, that you aren't exactly using the most performant LZMA options with 7z. If I am reading your args correctly, you aren't using multithreading and could be using fast LZMA2 (although dwarfs currently doesn't use it so perhaps not ideal for comparison). Your xz example is better on this front.

My preferred option is the following, which in my testing was very comparable to bzip3, albiet still slower but for a bit higher ratio (and I'm sure could be tweaked to be more similar with some effort):

7z a -m0=flzma2 -mx9 -mmt=8 -mfb=273 -md=256m -ms=on

I am curious to see how bzip3 would perform in a dwarfs scenario, regardless.

kspalaiologos commented 2 years ago

At this point I'm pretty sure that there are as many opinions about compression as people in this thread. Anyway, regarding flzma2 - it's not present in my 7zip build, nor the official build, neither in my distro's repositories, so I can't verify your results.

If I am reading your args correctly, you aren't using multithreading

7zip seems to have consumed two of my cores.

Phantop commented 2 years ago

At this point I'm pretty sure that there are as many opinions about compression as people in this thread. Anyway, regarding flzma2 - it's not present in my 7zip build, nor the official build, neither in my distro's repositories, so I can't verify your results.

Ah, I forgot that my distro of choice utilizes this fork: https://github.com/jinfeihan57/p7zip.

If I am reading your args correctly, you aren't using multithreading

7zip seems to have consumed two of my cores.

Well, even still, the -mmt option can specify the usage of more cores. I meant full multithreading using all cores, apologies.

kspalaiologos commented 2 years ago

Results on the Mathematica installation:

% ./7z a -m0=flzma2 -mx9 -mmt=24 -mfb=273 -md=256m -ms=on wolfram.7z wolfram.tar

7-Zip [64] 17.04 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.04 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,24 CPUs x64)

Scanning the drive:
1 file, 10487152640 bytes (10002 MiB)

Creating archive: wolfram.7z

Items to compress: 1

Files read from disk: 1
Archive size: 3318400333 bytes (3165 MiB)
Everything is Ok
./7z a -m0=flzma2 -mx9 -mmt=24 -mfb=273 -md=256m -ms=on wolfram.7z wolfram.ta  10041.80s user 42.31s system 1780% cpu 1878M memory 9:26.25 total

They seem to be indeed much better than before, but it still seems just a little worse regarding the compression ratio than lrz+bzip3. And 10 times slower (10042s CPU time vs 1466s).

Phantop commented 2 years ago

7z doesn't implement any of its own long-range functionality. You'd need to either use lrzip with both or neither for a fully fair comparison.

I need to redo some of my own testing but with the data I tried (a large uncompressed pdf and the data from a game) bzip3 was typically faster and comparable in size. I can't remember whether 7z or bz3 won out.

bdnbje commented 2 years ago

LZMA with your flags took one and a half of an hour to compress 9GiB of data, so by an educated guess, it would probably take more than 3 hours to compress the dataset above. Secondly, bzip3 supports parallel decompression, just like bzip2, because all blocks are independent. Compared to the bzip3 timing of 7 minutes, LZMA is almost 12 times slower.

so does lzma2 if you pass the mt flag(albeit with a bit less CR) i didnt use it since i wanted the max CR and decoding it would be much faster anyway with a single core. (a crucial requirement for a read only filesystem.) I don't think it was worth to spend my entire morning on watching paint dry, but there's more - if we wanted a more fair comparison, we could apply some deduping to the initial data (which was presumably handled before by LZMA, as it's a LZ derivative):

there is no deduping step in 7z. how does it being a LZ derivative makes it a deduping algorithm? Which proves my point that bzip3 performs pretty well with the aid of long range deduping - bzip3 isn't exactly a LZ derivative, so to have a fair benchmark on these kinds of data, we'd need to apply some sort of dictionary coding.

your're already using a LZ derivative LZP in your codec? Another thing (that feels painfully obvious to me) I would like to point out that no compression algorithm is the philosopher's stone of computing that "turns lead into gold" under all circumstances. Hence bzip3, as a very different conceptually compressor from everything dwarfs supports could be quite beneficial to include. And it's not like we have to compress all blocks using the same compressor in the end.

BWT algorithms like in your case bzip3 excels in text and only in text, anything else you throw at it lzma will far outperform. But i do like the idea of compressing different blocks with different algorithms but then we are already pushing what is a read only FS to a archiver territory. Finally, regarding .lrz decompression times, xz takes 2:44.45 wall clock time, while bzip3 -j4 takes 4:30.55 which isn't quite symmetric, but I can admit that it's a bit too slow.

It will also use the same amount of decoding memory it required to encode which is ~5N iirc for libsais where as lzma require 32KB+dict size to decode and 11.5*dict size to encode. so the practical use cases of a bwt algorithm(bzip3) is limited to text and that too if you want to spend alot of threads to make it fast and feed the threads with a lot of memory. not the right place for a read only FS. notice why no filesystem ever used a BWT algorithm as their transparent compression method. :)

small test on silesia corpus: LZMA (compiled with gcc 11.2 with generic settings )

time 7z a -t7z -m0=lzma:d=203m:fb=273:mc=1000000000:lc=8 silesia.tar.7z silesia.tar

7-Zip (a) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
 64-bit locale=en_US.UTF-8 Threads:16, ASM

Scanning the drive:
1 file, 211957760 bytes (203 MiB)

Creating archive: silesia.tar.7z

Add new data to archive: 1 file, 211957760 bytes (203 MiB)

Files read from disk: 1
Archive size: 48022348 bytes (46 MiB)
Everything is Ok

real    1m22.642s
user    2m5.335s
sys 0m1.010s

bzip3 (compiled with AOCC for the best case scenario)

time bzip3 -e -f -b 64 -j 16  silesia.tar silesia.tar.bz3

real    0m9.909s
user    0m24.006s
sys 0m1.238s

file sizes:

du silesia*
206992  silesia.tar
46900   silesia.tar.7z
46268   silesia.tar.bz3

decode:

time bzip3 -t -j 16 silesia.tar.bz3

real    0m8.246s
user    0m19.949s
sys 0m1.238s

time 7z t silesia.tar.7z

7-Zip (a) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
 64-bit locale=en_US.UTF-8 Threads:16, ASM

Scanning the drive for archives:
1 file, 48022348 bytes (46 MiB)

Testing archive: silesia.tar.7z
--
Path = silesia.tar.7z
Type = 7z
Physical Size = 48022348
Headers Size = 130
Method = LZMA:203m
Solid = -
Blocks = 1

Everything is Ok

Size:       211957760
Compressed: 48022348

real    0m1.782s
user    0m1.687s
sys 0m0.093s

bzip3 here wins by a very tiny margin in CR with also 10x faster encode than lzma but its 8x slower to decode and to use 16 threads will require like 10x the memory at around 5GB and lzma would require around 205MB to decode. PS: I didnt include the likes of flzma2 since they are only faster to encode and igor since then has pushed updates(21xx) that has some speed optimizations.

mhx commented 1 year ago

I'm a bit reluctant to spend the effort of including bzip3 compression at the moment.

I did a quick-and-dirty performance test, but I reckon it's somewhat representative.

I took a DwarFS image with >1000 Perl installations (so mixed text and binary data) in it. This image uses lzma compression and is about 300 MiB in size (down from 50 GiB of input data). I created an "uncompressed" image from it:

mkdwarfs --recompress=all -l0 -i perl-install.dwarfs -o perl-install-uncompressed.dwarfs

This uncompressed image is 4.2 GiB. I used a script to extract the individual blocks:

extract_blocks.py perl-install-uncompressed.dwarfs blocks/perl-install

This resulted in 69 blocks (data, metadata, index, ...), most of which were 64 MiB in size. A few were as small as 552 bytes. Compressing each of these blocks individually gives an exact estimate of how well a compression algorithm performs in the context of DwarFS. So I did the following:

% time find blocks -type f -print0 | xargs -0 -n1 -P8 -- bash -c 'zstd -9 -c $0 >${0/blocks/zstd}.zstd' 
find blocks -type f -print0  0.00s user 0.00s system 82% cpu 0.002 total
xargs -0 -n1 -P8 -- bash -c 'zstd -9 -c $0 >${0/blocks/zstd}.zstd'  107.67s user 4.11s system 758% cpu 14.742 total
% time find blocks -type f -print0 | xargs -0 -n1 -P8 -- bash -c '../bzip3 -b 511 -c $0 >${0/blocks/bzip3}.bz3'
find blocks -type f -print0  0.00s user 0.00s system 84% cpu 0.002 total
xargs -0 -n1 -P8 -- bash -c '../bzip3 -b 511 -c $0 >${0/blocks/bzip3}.bz3'  422.52s user 53.44s system 749% cpu 1:03.48 total

I picked -b 511 for bzip3 as that seemed to be where it performed quite well in its own benchmarks. I picked -9 for zstd pretty much randomly. As can be seen, zstd is about 4-5 times faster than bzip3. This is obviously highly dependent on the compression level, but already at this level, zstd compresses better than bzip3:

% du -hs blocks bzip3 zstd    
4.2G    blocks
488M    bzip3
467M    zstd

With lzma, as mentioned before, this goes down to about 300 MiB.

This is certainly not an exhaustive test, but I currently don't see bzip3 outperforming any of the existing options except in edge cases. For reference, I used the v1.1.8 x86_64 bzip3 binary.

I'm moving this issue to discussions for now; if there are significant improvements to bzip3, I'm going to reconsider adding support.