oleg-st / ZstdSharp

Port of zstd compression library to c#
MIT License
217 stars 29 forks source link

Example needed: TrainFromBufferFastCover #27

Closed minimalisticMe closed 11 months ago

minimalisticMe commented 11 months ago

Hi,

after the recent update there is a new function ZstdSharp.DictBuilder.TrainFromBufferFastCover, which I think enriches this library massively and makes zstd perfect for .NET usage. I want to try and train a custom compression dictionary but have trouble getting results.

Here is my code roughly minimized. First I collect a list of byte[] by reading all files (json files) and adding them to a list

private static List<byte[]> TestData()
{
    var testData = TestDataList(); // list of paths to json-files

    var testDataList = new List<byte[]>();
    foreach (var item in testData)
    {
        var content = File.ReadAllBytes(item);
        testDataList.Add(content);
    }

    return testDataList;
}

After that I want to train them by calling

var testData = TestData();
var bytes = ZstdSharp.DictBuilder.TrainFromBufferFastCover(testData, 5);

and write the result to a file for future use.

However calling ZstdSharp.DictBuilder.TrainFromBufferFastCover results in the following error and I don't know how to handle this correctly:

ZstdSharp.ZstdException : Src size is incorrect

    ThrowHelper.ThrowException(UIntPtr returnValue, String message)
    ThrowHelper.EnsureZdictSuccess(UIntPtr returnValue)
    DictBuilder.TrainFromBufferFastCover(IEnumerable`1 samples, ZDICT_fastCover_params_t params, Int32 dictCapacity)
    DictBuilder.TrainFromBufferFastCover(IEnumerable`1 samples, Int32 level, Int32 dictCapacity)
    TrainZstdDictionary.TrainDictionary() line 20
    RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
    MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)

How do I train a custom dictionary on my own files correctly?

oleg-st commented 11 months ago

Hi,

You may not have enough samples to train, try giving more

https://github.com/oleg-st/ZstdSharp/blob/26c25793cad1bdd82223437f4ceb3ea31708465d/src/ZstdSharp/Unsafe/Zdict.cs#L465C33-L465C33

minimalisticMe commented 11 months ago

I apologize the late answer, it took me some time to procure some more test data. Now with >12.000 json-files it is working just fine :)

Mrgaton commented 1 month ago

How much did the compression improved with a trained dictionary?

oleg-st commented 1 month ago

https://github.com/facebook/zstd?tab=readme-ov-file#the-case-for-small-data-compression

Mrgaton commented 1 month ago

One question, if i compress data with my custom trained dictionary i could then decompress it without loading my dictionary?

oleg-st commented 1 month ago

No