Closed kspalaiologos closed 1 year ago
Thanks, this looks really interesting! I'll take a closer look as soon as I have some time.
Testing bzip3 on a set of binary data: bzip3
time bzip3 -e -f -b 511 -j 2 mame.tar mame.tar.bz3
real 3m56.233s
user 7m21.964s
sys 0m4.543s
LZMA
time 7z a -t7z -m0=lzma:d=512m:fb=273:mc=1000000000:lc=8 mame.tar.7z mame.tar
7-Zip (a) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
64-bit locale=en_US.UTF-8 Threads:16, ASM
Scanning the drive:
1 file, 4218286080 bytes (4023 MiB)
Creating archive: mame.tar.7z
Add new data to archive: 1 file, 4218286080 bytes (4023 MiB)
Files read from disk: 1
Archive size: 1411638532 bytes (1347 MiB)
Everything is Ok
real 29m54.570s
user 45m24.571s
sys 0m10.005s
sizes of the files:
du mame*
4119420 mame.tar
1378556 mame.tar.7z
1629692 mame.tar.bz3
lzma has specific tricks that will always make it smaller in binary data and faster to decode than bzip3. the only time bzip3 might win is text data but even that could be beat by using lc8:lp0:pb0 . bzip3 is not a good compression algorithm for a fs since its not assymetrical i.e it will require the same time to decompress as to compress whereas lzma is 10x faster to decode than to encode. xz which is a derivative of lzma2 is a far simpler version from the lzma2 in igor's sdk lzma the one i used here (LZMA1) much ideal since it allows more of the settings to be used.The default llevel 9 of the xz is not optimal for all data. decode of lzma:
time 7z t mame.tar.7z
7-Zip (a) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
64-bit locale=en_US.UTF-8 Threads:16, ASM
Scanning the drive for archives:
1 file, 1411638532 bytes (1347 MiB)
Testing archive: mame.tar.7z
--
Path = mame.tar.7z
Type = 7z
Physical Size = 1411638532
Headers Size = 122
Method = LZMA:29
Solid = -
Blocks = 1
Everything is Ok
Size: 4218286080
Compressed: 1411638532
real 0m46.479s
user 0m45.334s
sys 0m0.553s
I beg to differ. I used your flags on a Wolfram Mathematica 12.0 install which is a fairly diverse corpus (a few Java runtimes, a bunch of executables/dlls, some Wolfram language code), and this is what I got:
% time bzip3 -e -b 511 -j 4 mathematica.tar
bzip3 -e -b 511 -j 4 mathematica.tar 1465.57s user 10.40s system 343% cpu 12247M memory 7:09.20 total
% time 7z a -t7z -m0=lzma:d=512m:fb=273:mc=1000000000:lc=8 mathematica.tar.7z mathematica.tar
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,24 CPUs AMD Ryzen 9 3900X 12-Core Processor (870F10),ASM,AES-NI)
Scanning the drive:
1 file, 10487152640 bytes (10002 MiB)
Creating archive: mathematica.tar.7z
Items to compress: 1
Files read from disk: 1
Archive size: 3151839730 bytes (3006 MiB)
Everything is Ok
7z a -t7z -m0=lzma:d=512m:fb=273:mc=1000000000:lc=8 mathematica.tar.7z 6927.57s user 20.74s system 144% cpu 5388M memory 1:20:10.82 total
% time pbzip2 -b500 -k -f mathematica.tar
pbzip2 -b500 -k -f mathematica.tar 1663.41s user 34.60s system 2162% cpu 3142M memory 1:18.53 total
% time zpaq a mathematica.zpaq mathematica.tar -t24 -m4
zpaq a mathematica.zpaq mathematica.tar -t24 -m4 15070.48s user 144.42s system 2190% cpu 11961M memory 11:34.58 total
% du -h mathematica*
9.8G mathematica.tar
3.0G mathematica.tar.7z
4.2G mathematica.tar.bz2
3.4G mathematica.tar.bz3
3.1G mathematica.zpaq
LZMA with your flags took one and a half of an hour to compress 9GiB of data, so by an educated guess, it would probably take more than 3 hours to compress the dataset above. Secondly, bzip3 supports parallel decompression, just like bzip2, because all blocks are independent. Compared to the bzip3 timing of 7 minutes, LZMA is almost 12 times slower.
I don't think it was worth to spend my entire morning on watching paint dry, but there's more - if we wanted a more fair comparison, we could apply some deduping to the initial data (which was presumably handled before by LZMA, as it's a LZ derivative):
% time lrzip -n -q -T mathematica.tar
Output filename is: mathematica.tar.lrz
mathematica.tar - Compression Ratio: 1.360. Average Compression Speed: 84.042MB/s.
Total time: 00:01:59.01
lrzip -n -q -T mathematica.tar 95.39s user 62.06s system 132% cpu 17428M memory 1:59.02 total
% bzip3 -f -e -b 511 -j 4 mathematica.tar.lrz
bzip3 -f -e -b 511 -j 4 mathematica.tar.lrz 1261.14s user 10.08s system 337% cpu 12250M memory 6:17.19 total
# Because I don't have much more time this morning to waste on the other LZMA frontend:
% xz -8 --threads=16 --extreme mathematica.tar.lrz
xz -8 --threads=16 --extreme mathematica.tar.lrz 5121.02s user 12.29s system 1522% cpu 9131M memory 5:37.06 total
% du -h mathematica*
9.8G mathematica.tar
3.1G mathematica.tar.lrz.bz3
3.0G mathematica.tar.lrz.xz
Which proves my point that bzip3 performs pretty well with the aid of long range deduping - bzip3 isn't exactly a LZ derivative, so to have a fair benchmark on these kinds of data, we'd need to apply some sort of dictionary coding.
Another thing (that feels painfully obvious to me) I would like to point out that no compression algorithm is the philosopher's stone of computing that "turns lead into gold" under all circumstances. Hence bzip3, as a very different conceptually compressor from everything dwarfs supports could be quite beneficial to include. And it's not like we have to compress all blocks using the same compressor in the end.
Finally, regarding .lrz
decompression times, xz takes 2:44.45
wall clock time, while bzip3 -j4
takes 4:30.55
which isn't quite symmetric, but I can admit that it's a bit too slow.
@kspalaiologos, I would like to pipe in and note, however, that you aren't exactly using the most performant LZMA options with 7z. If I am reading your args correctly, you aren't using multithreading and could be using fast LZMA2 (although dwarfs currently doesn't use it so perhaps not ideal for comparison). Your xz example is better on this front.
My preferred option is the following, which in my testing was very comparable to bzip3, albiet still slower but for a bit higher ratio (and I'm sure could be tweaked to be more similar with some effort):
7z a -m0=flzma2 -mx9 -mmt=8 -mfb=273 -md=256m -ms=on
I am curious to see how bzip3 would perform in a dwarfs scenario, regardless.
At this point I'm pretty sure that there are as many opinions about compression as people in this thread. Anyway, regarding flzma2 - it's not present in my 7zip build, nor the official build, neither in my distro's repositories, so I can't verify your results.
If I am reading your args correctly, you aren't using multithreading
7zip seems to have consumed two of my cores.
At this point I'm pretty sure that there are as many opinions about compression as people in this thread. Anyway, regarding flzma2 - it's not present in my 7zip build, nor the official build, neither in my distro's repositories, so I can't verify your results.
Ah, I forgot that my distro of choice utilizes this fork: https://github.com/jinfeihan57/p7zip.
If I am reading your args correctly, you aren't using multithreading
7zip seems to have consumed two of my cores.
Well, even still, the -mmt
option can specify the usage of more cores. I meant full multithreading using all cores, apologies.
Results on the Mathematica installation:
% ./7z a -m0=flzma2 -mx9 -mmt=24 -mfb=273 -md=256m -ms=on wolfram.7z wolfram.tar
7-Zip [64] 17.04 : Copyright (c) 1999-2021 Igor Pavlov : 2017-08-28
p7zip Version 17.04 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,24 CPUs x64)
Scanning the drive:
1 file, 10487152640 bytes (10002 MiB)
Creating archive: wolfram.7z
Items to compress: 1
Files read from disk: 1
Archive size: 3318400333 bytes (3165 MiB)
Everything is Ok
./7z a -m0=flzma2 -mx9 -mmt=24 -mfb=273 -md=256m -ms=on wolfram.7z wolfram.ta 10041.80s user 42.31s system 1780% cpu 1878M memory 9:26.25 total
They seem to be indeed much better than before, but it still seems just a little worse regarding the compression ratio than lrz+bzip3. And 10 times slower (10042s CPU time vs 1466s).
7z doesn't implement any of its own long-range functionality. You'd need to either use lrzip with both or neither for a fully fair comparison.
I need to redo some of my own testing but with the data I tried (a large uncompressed pdf and the data from a game) bzip3 was typically faster and comparable in size. I can't remember whether 7z or bz3 won out.
LZMA with your flags took one and a half of an hour to compress 9GiB of data, so by an educated guess, it would probably take more than 3 hours to compress the dataset above. Secondly, bzip3 supports parallel decompression, just like bzip2, because all blocks are independent. Compared to the bzip3 timing of 7 minutes, LZMA is almost 12 times slower.
so does lzma2 if you pass the mt flag(albeit with a bit less CR) i didnt use it since i wanted the max CR and decoding it would be much faster anyway with a single core. (a crucial requirement for a read only filesystem.) I don't think it was worth to spend my entire morning on watching paint dry, but there's more - if we wanted a more fair comparison, we could apply some deduping to the initial data (which was presumably handled before by LZMA, as it's a LZ derivative):
there is no deduping step in 7z. how does it being a LZ derivative makes it a deduping algorithm? Which proves my point that bzip3 performs pretty well with the aid of long range deduping - bzip3 isn't exactly a LZ derivative, so to have a fair benchmark on these kinds of data, we'd need to apply some sort of dictionary coding.
your're already using a LZ derivative LZP in your codec? Another thing (that feels painfully obvious to me) I would like to point out that no compression algorithm is the philosopher's stone of computing that "turns lead into gold" under all circumstances. Hence bzip3, as a very different conceptually compressor from everything dwarfs supports could be quite beneficial to include. And it's not like we have to compress all blocks using the same compressor in the end.
BWT algorithms like in your case bzip3 excels in text and only in text, anything else you throw at it lzma will far outperform. But i do like the idea of compressing different blocks with different algorithms but then we are already pushing what is a read only FS to a archiver territory. Finally, regarding
.lrz
decompression times, xz takes2:44.45
wall clock time, whilebzip3 -j4
takes4:30.55
which isn't quite symmetric, but I can admit that it's a bit too slow.It will also use the same amount of decoding memory it required to encode which is ~5N iirc for libsais where as lzma require 32KB+dict size to decode and 11.5*dict size to encode. so the practical use cases of a bwt algorithm(bzip3) is limited to text and that too if you want to spend alot of threads to make it fast and feed the threads with a lot of memory. not the right place for a read only FS. notice why no filesystem ever used a BWT algorithm as their transparent compression method. :)
small test on silesia corpus: LZMA (compiled with gcc 11.2 with generic settings )
time 7z a -t7z -m0=lzma:d=203m:fb=273:mc=1000000000:lc=8 silesia.tar.7z silesia.tar
7-Zip (a) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
64-bit locale=en_US.UTF-8 Threads:16, ASM
Scanning the drive:
1 file, 211957760 bytes (203 MiB)
Creating archive: silesia.tar.7z
Add new data to archive: 1 file, 211957760 bytes (203 MiB)
Files read from disk: 1
Archive size: 48022348 bytes (46 MiB)
Everything is Ok
real 1m22.642s
user 2m5.335s
sys 0m1.010s
bzip3 (compiled with AOCC for the best case scenario)
time bzip3 -e -f -b 64 -j 16 silesia.tar silesia.tar.bz3
real 0m9.909s
user 0m24.006s
sys 0m1.238s
file sizes:
du silesia*
206992 silesia.tar
46900 silesia.tar.7z
46268 silesia.tar.bz3
decode:
time bzip3 -t -j 16 silesia.tar.bz3
real 0m8.246s
user 0m19.949s
sys 0m1.238s
time 7z t silesia.tar.7z
7-Zip (a) 21.07 (x64) : Copyright (c) 1999-2021 Igor Pavlov : 2021-12-26
64-bit locale=en_US.UTF-8 Threads:16, ASM
Scanning the drive for archives:
1 file, 48022348 bytes (46 MiB)
Testing archive: silesia.tar.7z
--
Path = silesia.tar.7z
Type = 7z
Physical Size = 48022348
Headers Size = 130
Method = LZMA:203m
Solid = -
Blocks = 1
Everything is Ok
Size: 211957760
Compressed: 48022348
real 0m1.782s
user 0m1.687s
sys 0m0.093s
bzip3 here wins by a very tiny margin in CR with also 10x faster encode than lzma but its 8x slower to decode and to use 16 threads will require like 10x the memory at around 5GB and lzma would require around 205MB to decode. PS: I didnt include the likes of flzma2 since they are only faster to encode and igor since then has pushed updates(21xx) that has some speed optimizations.
I'm a bit reluctant to spend the effort of including bzip3
compression at the moment.
I did a quick-and-dirty performance test, but I reckon it's somewhat representative.
I took a DwarFS image with >1000 Perl installations (so mixed text and binary data) in it. This image uses lzma compression and is about 300 MiB in size (down from 50 GiB of input data). I created an "uncompressed" image from it:
mkdwarfs --recompress=all -l0 -i perl-install.dwarfs -o perl-install-uncompressed.dwarfs
This uncompressed image is 4.2 GiB. I used a script to extract the individual blocks:
extract_blocks.py perl-install-uncompressed.dwarfs blocks/perl-install
This resulted in 69 blocks (data, metadata, index, ...), most of which were 64 MiB in size. A few were as small as 552 bytes. Compressing each of these blocks individually gives an exact estimate of how well a compression algorithm performs in the context of DwarFS. So I did the following:
% time find blocks -type f -print0 | xargs -0 -n1 -P8 -- bash -c 'zstd -9 -c $0 >${0/blocks/zstd}.zstd'
find blocks -type f -print0 0.00s user 0.00s system 82% cpu 0.002 total
xargs -0 -n1 -P8 -- bash -c 'zstd -9 -c $0 >${0/blocks/zstd}.zstd' 107.67s user 4.11s system 758% cpu 14.742 total
% time find blocks -type f -print0 | xargs -0 -n1 -P8 -- bash -c '../bzip3 -b 511 -c $0 >${0/blocks/bzip3}.bz3'
find blocks -type f -print0 0.00s user 0.00s system 84% cpu 0.002 total
xargs -0 -n1 -P8 -- bash -c '../bzip3 -b 511 -c $0 >${0/blocks/bzip3}.bz3' 422.52s user 53.44s system 749% cpu 1:03.48 total
I picked -b 511
for bzip3
as that seemed to be where it performed quite well in its own benchmarks. I picked -9
for zstd
pretty much randomly. As can be seen, zstd
is about 4-5 times faster than bzip3
. This is obviously highly dependent on the compression level, but already at this level, zstd
compresses better than bzip3
:
% du -hs blocks bzip3 zstd
4.2G blocks
488M bzip3
467M zstd
With lzma
, as mentioned before, this goes down to about 300 MiB.
This is certainly not an exhaustive test, but I currently don't see bzip3
outperforming any of the existing options except in edge cases. For reference, I used the v1.1.8 x86_64 bzip3 binary.
I'm moving this issue to discussions for now; if there are significant improvements to bzip3
, I'm going to reconsider adding support.
Over the last two weeks I've been working on bzip3, which is ultimately intended to replace bzip2. I believe that it'd be a good addition to
dwarfs
judging by the compression ratios it offers and its fairly good performance.To mimic your benchmark in a way, first, I have downloaded every version of Perl5 ever released and decompressed them.
Then, I put all the resulting
.tar
files in a single.tar
file and tried to compress it using various compressors:The results follow:
Then, I used
lrzip
to perform long-range deduplication on the original.tar
file:Finally, I compressed the resulting
none.tar.lrz
file using bzip3:The results follow:
I believe that the long-range deduping that dwarfs offers coupled together with block-level deduping that is implanted into bzip3 and a powerful BWT entropy post-coder would excel in particular on source code, text, various documents, etc...
I have also tested
dwarfs
on the directory with tarballs yielding the following result:In retrospect, I should have unpacked all the tarballs, but
lrzip
seems to have done a great job at deduping without this. Maybe this functionality could be implemented into dwarfs at some point in the future too.