martinellimarco / t2sz

Compress a file into a seekable zstd with special handling for .tar archives
GNU General Public License v3.0
42 stars 0 forks source link

Dictionary support? #5

Closed randocolab closed 2 years ago

randocolab commented 2 years ago

I was wondering if t2sz supports dictionaries.

martinellimarco commented 2 years ago

Not at the moment but it's possible to add it. What's your use case?

randocolab commented 2 years ago

Actually never-mind, I ran some tests and I think I'm currently I/O bottle-necked, on my fastest drive and the performance difference is negligible and sometimes worse... I'm working with a lot of (mostly) XML that are relatively small, totaling 120 GB.

zstd v1.5.3:

9,3 GiB folder

with dictionary ( zstd -D ../dictionary -b1e3 -r folder )

Not enough memory; testing 2709 MB only...
 1# 126114 files     :2840941909 -> 763165098 (x3.723),  672.3 MB/s, 2083.5 MB/s
Not enough memory; testing 2709 MB only...
 2# 126114 files     :2840941909 -> 762555377 (x3.726),  521.7 MB/s, 1751.4 MB/s
Not enough memory; testing 2709 MB only...
 3# 126114 files     :2840941909 -> 743239075 (x3.822),  405.9 MB/s, 1864.2 MB/s

without ( zstd -b1e3 -r folder )

 1# 126114 files     :2840941909 -> 777194030 (x3.655),  603.1 MB/s, 1930.7 MB/s
Not enough memory; testing 2709 MB only...
 2# 126114 files     :2840941909 -> 771383515 (x3.683),  589.0 MB/s, 1983.4 MB/s
Not enough memory; testing 2709 MB only...
 3# 126114 files     :2840941909 -> 762421927 (x3.726),  445.9 MB/s, 1894.5 MB/s

Only XML, smaller sample size.

 Without dict.
 1# 7770 files       : 615594476 -> 103708729 (x5.936),  618.1 MB/s, 1900.6 MB/s 
 2# 7770 files       : 615594476 -> 103270086 (x5.961),  525.3 MB/s, 1755.5 MB/s 
 3# 7770 files       : 615594476 -> 101909992 (x6.041),  405.7 MB/s, 1667.9 MB/s 
 With dict.
 1# 7770 files       : 615594476 -> 101050408 (x6.092),  607.5 MB/s, 1842.2 MB/s 
 2# 7770 files       : 615594476 -> 102164634 (x6.026),  467.8 MB/s, 1533.3 MB/s 
 3# 7770 files       : 615594476 ->  97459467 (x6.316),  373.0 MB/s, 1634.0 MB/s 

I ran in on a 45 GB dataset, default compression, compression with dictionary


real    3m28.101s
user    2m6.894s
sys     0m20.311s

default compression, compression without dictionary

real    3m37.843s
user    2m8.074s
sys     0m21.430s

These results may not be completely accurate because of I/O and system load.

Extraction with dictionary

real    10m43.698s
user    0m49.051s
sys     0m40.257s

Extraction without dictionary

real    6m25.549s
user    0m40.446s
sys     0m33.192s
randocolab commented 2 years ago

Re-opening, someone told me that if the block size is small the dictionary can be of benefit. Decompression with a dictionary has a startup, which would be one time for each block. These benchmarks are of huge block size, making the dictionary less effective.. I believe? (Still trying to wrap my head around it)

martinellimarco commented 2 years ago

Yes, this is correct.

The zstd tool compress everything into a single zstd block and in that block you will not find an advantage in using a dictionary. Dictionaries comes in handy when you have multiple blocks containing similar data. XML should be a good fit as you have repeating tags everywhere. The idea is that the first archive you create also generate a dictionary and then you reuse said dictionary with other archives with similar content that will benefit from previous knowledge.

In t2sz the thing is different. t2sz specifically split the archive in multiple blocks. This is to be able to seek around without having to decompress previous data. Very useful if you plan to mount your archive with ratarmount for example.

In zstd format each block is independent and knowledge is not shared. t2sz have different operation modes, if you are compressing tar archives by default it compress 1 file per block. You can change it specifying a block size that will allow for more files to be stored in one block. In raw mode it's different, see the help for all the details. The idea here is that you can control the tradeoff between having fewer bigger blocks (compress better) and having many smaller blocks (faster seeking, worse compression).

Here dictionaries comes in handy. If all the smaller blocks are compressing very similar data then adding a dictionary may result in better compression while keeping faster seeking.

When first writing the prototype of t2sz internally I played around with dictionaries but it wasn't working for our particular dataset so I didn't bother implementing it but I can revisit it now.

Can I ask you to split your dataset in groups, create a dictionary on the first group and reuse that dictionary on the others? For example divide it in 10 folders and benchmark it. I want to collect some data to see if it's worth or if the gain is negligible.

randocolab commented 2 years ago

Can I ask you to split your dataset in groups, create a dictionary on the first group and reuse that dictionary on the others? For example divide it in 10 folders and benchmark it. I want to collect some data to see if it's worth or if the gain is negligible.

Filtered for XML only, I filled 10 folders to 100 MB of files (non-duplicative). I ran dict on folder 1, converted each folder to tar. Applied compression to each one.

Ratio of 9.tar
With     Dict: 5.147270983303878
Without  Dict: 5.143450892544796
Diffrence = 0.003820090759082184
Ratio of 1.tar -- The dict was based on this directory.
With     Dict: 4.890452961214203
Without  Dict: 4.889670514054602
Diffrence = 0.000782447159600963
Ratio of 8.tar
With     Dict: 5.3239773387241325
Without  Dict: 5.321755989558445
Diffrence = 0.0022213491656870588
Ratio of 4.tar
With     Dict: 5.372482275830157
Without  Dict: 5.368492587367826
Diffrence = 0.0039896884623304985
Ratio of 3.tar
With     Dict: 5.236225448164826
Without  Dict: 5.234008102451942
Diffrence = 0.0022173457128840113
Ratio of 5.tar
With     Dict: 4.0884161552890355
Without  Dict: 4.091272543149203
Diffrence = -0.0028563878601675086
Ratio of 10.tar
With     Dict: 4.735349480065813
Without  Dict: 4.732315379911547
Diffrence = 0.003034100154265751
Ratio of 7.tar
With     Dict: 5.032492033727829
Without  Dict: 5.025801574598314
Diffrence = 0.006690459129514714
Ratio of 2.tar
With     Dict: 5.16204449702911
Without  Dict: 5.160018201427454
Diffrence = 0.002026295601655903
Ratio of 6.tar
With     Dict: 5.1853982615005
Without  Dict: 5.180661346909717
Diffrence = 0.004736914590782959

Average ratio:

With directory Excluding directory 1: 
5.031517385959476
Without directory Excluding directory 1: 
5.028641846435471
Diff:
0.0028755395240045445
With directory: 
5.017410943484949
Without directory: 
5.014744713197385
Diff: 0.0026662302875637423

I'm wondering if I did anything wrong because the difference in compression ratio with and without dict of the 1st folder is is the worst. Despite being based on that folder....

martinellimarco commented 2 years ago

You are seeing results that more or less replicate my initial experiments.

I was compressing binary data, I was sure that XML would have worked better. I guess zstd is just very good even without a dictionary or we both didn't get how to create a working dictionary hahaha.

I will take some time this weekend to do some experiment with it just to be sure we are not missing something.

martinellimarco commented 2 years ago

Tested, but it isn't worth it.