Closed randocolab closed 2 years ago
Not at the moment but it's possible to add it. What's your use case?
Actually never-mind, I ran some tests and I think I'm currently I/O bottle-necked, on my fastest drive and the performance difference is negligible and sometimes worse... I'm working with a lot of (mostly) XML that are relatively small, totaling 120 GB.
zstd v1.5.3:
9,3 GiB folder
with dictionary ( zstd -D ../dictionary -b1e3 -r folder )
Not enough memory; testing 2709 MB only...
1# 126114 files :2840941909 -> 763165098 (x3.723), 672.3 MB/s, 2083.5 MB/s
Not enough memory; testing 2709 MB only...
2# 126114 files :2840941909 -> 762555377 (x3.726), 521.7 MB/s, 1751.4 MB/s
Not enough memory; testing 2709 MB only...
3# 126114 files :2840941909 -> 743239075 (x3.822), 405.9 MB/s, 1864.2 MB/s
without ( zstd -b1e3 -r folder )
1# 126114 files :2840941909 -> 777194030 (x3.655), 603.1 MB/s, 1930.7 MB/s
Not enough memory; testing 2709 MB only...
2# 126114 files :2840941909 -> 771383515 (x3.683), 589.0 MB/s, 1983.4 MB/s
Not enough memory; testing 2709 MB only...
3# 126114 files :2840941909 -> 762421927 (x3.726), 445.9 MB/s, 1894.5 MB/s
Only XML, smaller sample size.
Without dict.
1# 7770 files : 615594476 -> 103708729 (x5.936), 618.1 MB/s, 1900.6 MB/s
2# 7770 files : 615594476 -> 103270086 (x5.961), 525.3 MB/s, 1755.5 MB/s
3# 7770 files : 615594476 -> 101909992 (x6.041), 405.7 MB/s, 1667.9 MB/s
With dict.
1# 7770 files : 615594476 -> 101050408 (x6.092), 607.5 MB/s, 1842.2 MB/s
2# 7770 files : 615594476 -> 102164634 (x6.026), 467.8 MB/s, 1533.3 MB/s
3# 7770 files : 615594476 -> 97459467 (x6.316), 373.0 MB/s, 1634.0 MB/s
I ran in on a 45 GB dataset, default compression, compression with dictionary
real 3m28.101s
user 2m6.894s
sys 0m20.311s
default compression, compression without dictionary
real 3m37.843s
user 2m8.074s
sys 0m21.430s
These results may not be completely accurate because of I/O and system load.
Extraction with dictionary
real 10m43.698s
user 0m49.051s
sys 0m40.257s
Extraction without dictionary
real 6m25.549s
user 0m40.446s
sys 0m33.192s
Re-opening, someone told me that if the block size is small the dictionary can be of benefit. Decompression with a dictionary has a startup, which would be one time for each block. These benchmarks are of huge block size, making the dictionary less effective.. I believe? (Still trying to wrap my head around it)
Yes, this is correct.
The zstd tool compress everything into a single zstd block and in that block you will not find an advantage in using a dictionary. Dictionaries comes in handy when you have multiple blocks containing similar data. XML should be a good fit as you have repeating tags everywhere. The idea is that the first archive you create also generate a dictionary and then you reuse said dictionary with other archives with similar content that will benefit from previous knowledge.
In t2sz the thing is different. t2sz specifically split the archive in multiple blocks. This is to be able to seek around without having to decompress previous data. Very useful if you plan to mount your archive with ratarmount for example.
In zstd format each block is independent and knowledge is not shared. t2sz have different operation modes, if you are compressing tar archives by default it compress 1 file per block. You can change it specifying a block size that will allow for more files to be stored in one block. In raw mode it's different, see the help for all the details. The idea here is that you can control the tradeoff between having fewer bigger blocks (compress better) and having many smaller blocks (faster seeking, worse compression).
Here dictionaries comes in handy. If all the smaller blocks are compressing very similar data then adding a dictionary may result in better compression while keeping faster seeking.
When first writing the prototype of t2sz internally I played around with dictionaries but it wasn't working for our particular dataset so I didn't bother implementing it but I can revisit it now.
Can I ask you to split your dataset in groups, create a dictionary on the first group and reuse that dictionary on the others? For example divide it in 10 folders and benchmark it. I want to collect some data to see if it's worth or if the gain is negligible.
Can I ask you to split your dataset in groups, create a dictionary on the first group and reuse that dictionary on the others? For example divide it in 10 folders and benchmark it. I want to collect some data to see if it's worth or if the gain is negligible.
Filtered for XML only, I filled 10 folders to 100 MB of files (non-duplicative). I ran dict on folder 1, converted each folder to tar. Applied compression to each one.
Ratio of 9.tar
With Dict: 5.147270983303878
Without Dict: 5.143450892544796
Diffrence = 0.003820090759082184
Ratio of 1.tar -- The dict was based on this directory.
With Dict: 4.890452961214203
Without Dict: 4.889670514054602
Diffrence = 0.000782447159600963
Ratio of 8.tar
With Dict: 5.3239773387241325
Without Dict: 5.321755989558445
Diffrence = 0.0022213491656870588
Ratio of 4.tar
With Dict: 5.372482275830157
Without Dict: 5.368492587367826
Diffrence = 0.0039896884623304985
Ratio of 3.tar
With Dict: 5.236225448164826
Without Dict: 5.234008102451942
Diffrence = 0.0022173457128840113
Ratio of 5.tar
With Dict: 4.0884161552890355
Without Dict: 4.091272543149203
Diffrence = -0.0028563878601675086
Ratio of 10.tar
With Dict: 4.735349480065813
Without Dict: 4.732315379911547
Diffrence = 0.003034100154265751
Ratio of 7.tar
With Dict: 5.032492033727829
Without Dict: 5.025801574598314
Diffrence = 0.006690459129514714
Ratio of 2.tar
With Dict: 5.16204449702911
Without Dict: 5.160018201427454
Diffrence = 0.002026295601655903
Ratio of 6.tar
With Dict: 5.1853982615005
Without Dict: 5.180661346909717
Diffrence = 0.004736914590782959
Average ratio:
With directory Excluding directory 1:
5.031517385959476
Without directory Excluding directory 1:
5.028641846435471
Diff:
0.0028755395240045445
With directory:
5.017410943484949
Without directory:
5.014744713197385
Diff: 0.0026662302875637423
I'm wondering if I did anything wrong because the difference in compression ratio with and without dict of the 1st folder is is the worst. Despite being based on that folder....
You are seeing results that more or less replicate my initial experiments.
I was compressing binary data, I was sure that XML would have worked better. I guess zstd is just very good even without a dictionary or we both didn't get how to create a working dictionary hahaha.
I will take some time this weekend to do some experiment with it just to be sure we are not missing something.
Tested, but it isn't worth it.
I was wondering if t2sz supports dictionaries.