oleg-st / ZstdSharp

Port of zstd compression library to c#
MIT License
200 stars 29 forks source link

Provide compression level for training dictionary #22

Closed pkese closed 11 months ago

pkese commented 1 year ago

Apparently, to get optimal performance when using dictionary, dictionary should be trained with the same compression level as the compression level when dictionary is going to be used.

Zstd's minimal search pattern size is dependent on compression level, e.g. if compression level is low, minimal pattern size is 4 bytes or more.
At higher compression levels Zstd will trade CPU time to search for patterns smaller than 4 bytes.

If dictionary is trained with low compression level then dictionary will contain only large patterns.
If then that dictionary is used for compressing actual data at high compression level,
then there won't be any less than 4-byte patterns in the dictionary so Zstd will do a lot of searching in vain. Consequently Zstd will be wasting energy and dictionary won't be used as efficiently.

Please provide an option to parametrize compression level when training the dictionary.

https://github.com/oleg-st/ZstdSharp/blob/5bd8080e555c5efd47b5de7e9725d4be47a4c438/src/ZstdSharp/Unsafe/Zdict.cs#L482

pkese commented 1 year ago

I'm measuring about 2.9% increase in compression rate with patch in #23 applied.

For test I'm compressing few tens of lines of short texts totaling 13206 bytes of raw text.
With dictionary trained at default compression level, that gets compressed down to 1560 bytes,
with dictionary trained at the same level as data compression, it comes down to 1515 bytes.

oleg-st commented 1 year ago

You can call ZDICT_optimizeTrainFromBuffer_fastCover with any required parameters from your code and make your own safe wrapper for it. This is part of the unsafe public API of the ZstdSharp library, which is the much same as the original zstd library. Safe wrappers such as DictBuilder has less functionality and is subject to extension.

oleg-st commented 11 months ago

Added TrainFromBufferFastCover method