BuildZStdDict Testing - Githubissues

macneale4 commented 6 months ago

Hello!

I'm hoping to use the compress package for packing bytes in our archive format for Dolt (https://github.com/dolthub/dolt). Our data is very amenable to custom dictionaries for small sets (5-50) of binary blobs which average about 4K in size.

We built our POC using the gozstd package which has Dictionary support: https://pkg.go.dev/github.com/valyala/gozstd

Now that I'm are convinced this is a good approach for us, I'm curious how we can move the dict.BuildZstdDict function out of the experimental phase, as it's stated in the dict/README.md. In particular, the README says it doesn't work well with small sets, and indeed I got a panic when using a set of 32 samples.

How far from complete do you think it is? How can I help get this tested and get the bugs ironed out? I have one local test case I can share with you if that's helpful.

FWIW, using the dictionary produced by valyala/gozstd, I can compress/decompress successfully with the compress/zstd package. So I think it's just a matter of generating a correct dict and we'll be able to move forward.

Thanks!

klauspost commented 6 months ago

I'm curious how we can move the dict.BuildZstdDict function out of the experimental phase

TBH I don't see it happening. I don't have the bandwidth for it at the moment.

However if you can make a simple reproducer for any crash I will fix it.

klauspost commented 6 months ago

Also sorry for being a bit distracted in my last replies... Let me provide some context:

dict.BuildZstdDict(input [][]byte, o Options) is the higher level API.

What it mainly does is to create a common dictionary for the input samples you provide. Deduplicating and truncating to opts.MaxDictSize. Once that is done it calls zstd.BuildDict to contruct the actual dictionary.

It will attempt to compress all of Contents [][]byte with History []byte as a common history to generate the optimal tables for these. These tables are then all combined with the history and written in the zstd dictionary format.

There are for sure untested corner cases in this process and it is mostly a "released PoC" - as in - it works - some/most of the time and it is "ok" in terms of (compression) efficiency.

macneale4 commented 6 months ago

That's good context. I'll try and bypass the higher level API and use the zstd.BuildDict next. In the meantime, here is a test which is failing.

https://github.com/macneale4/compress/commit/956c229351da335b4f95bfca04006b4e29fe5e87

There is a divide by 0 error, which I got around with a hack (see commit). I doubt that is right. But then once the dictionary is created, the loadDict method fails with the error tableLog too large

I'm out of my element when is comes to compression, but will probably be pecking at this for the next few days. We'd really like to avoid using CGO in our product.

klauspost commented 6 months ago

Fix in https://github.com/klauspost/compress/pull/951

Made a few tweaks to your test.

A) You are writing to the input buffer when decompressing. B) EncodeAll/DecodeAll appends- adding a [:0] to make that clear. C) MaxDictSize is for the backreference size, not the total size.

package dict

import (
    "bytes"
    "io"
    "math/rand"
    "os"
    "testing"

    "github.com/klauspost/compress/zstd"
)

func TestZStdDict(t *testing.T) {
    for _, level := range []zstd.EncoderLevel{zstd.SpeedFastest, zstd.SpeedDefault, zstd.SpeedBetterCompression, zstd.SpeedBestCompression} {
        testZStdDict(t, level)
    }
}

func testZStdDict(t *testing.T, level zstd.EncoderLevel) {
    out := io.Discard
    if testing.Verbose() {
        out = os.Stdout
    }
    opts := Options{
        MaxDictSize:    2048,
        HashBytes:      4,
        Output:         out,
        ZstdDictID:     0,
        ZstdDictCompat: false,
        ZstdLevel:      level,
    }

    inBuf := make([]byte, 0, 4096)
    outBuf := make([]byte, 0, 4096)

    // This is 32K worth of data, but it's all very similar. Only fits in 4K if compressed with a dictionary.
    samples := generateSimilarByteSlices(42, 32)

    dict, err := BuildZstdDict(samples, opts)
    if err != nil {
        t.Fatal(err.Error())
    }

    totalSize := 0
    for _, blob := range samples {
        compressed, err := zCompressDict(inBuf, dict, blob)
        if err != nil {
            t.Fatal(err.Error())
        }
        totalSize += len(compressed)

        // Check round trip.
        decompressed, err := zDecompressDict(outBuf, dict, compressed)
        if err != nil {
            t.Fatal(err.Error())
        }
        if !bytes.Equal(decompressed, blob) {
            t.Fatal("Round trip failed")
        }
    }
    if totalSize > 4096 {
        t.Fatal("Total compressed size exceeds 4096 bytes")
    }
    t.Log("Total compressed size:", totalSize)
}

func zCompressDict(dst, dict, data []byte) ([]byte, error) {
    encoder, err := zstd.NewWriter(nil, zstd.WithEncoderDict(dict))
    if err != nil {
        return nil, err
    }
    defer encoder.Close()

    result := encoder.EncodeAll(data, dst[:0])
    return result, nil
}

func zDecompressDict(dst, dict, data []byte) ([]byte, error) {
    decoder, err := zstd.NewReader(nil, zstd.WithDecoderDicts(dict))
    if err != nil {
        return nil, err
    }
    defer decoder.Close()

    result, err := decoder.DecodeAll(data, dst[:0])
    if err != nil {
        return nil, err
    }

    return result, nil
}

// Creates a slice of byte slices, each of which is has the same random seed, so they are very similar. The length
// of each slice is 1024 + the index of the slice.
func generateSimilarByteSlices(seed int64, count int) [][]byte {
    chks := make([][]byte, count)
    for i := 0; i < count; i++ {
        chks[i] = generateRandomByteSlice(seed, 1024+i)
        if false {
            // Generate a small diff.
            chks[i][i] = byte(seed)
        }
    }

    return chks
}

func generateRandomByteSlice(seed int64, len int) []byte {
    r := rand.NewSource(seed)

    data := make([]byte, len)
    for i := range data {
        data[i] = byte(r.Int63())
    }
    return data
}

With your permission I'd like to include that as a test.

macneale4 commented 6 months ago

I pulled your changes, and they do result in our unit tests passing. Great!

When exercising against real data, I ran into another edge case. I pushed a new test to my fork branch:

https://github.com/macneale4/compress/tree/zstd-dict-test

There tip commit has a speculative change (allowing v == threshold) will allows my testing to continue. I have no idea if that's correct though.

You are welcome to take any of this and stick into your tests. The data is coming from one of our public data repos.

klauspost commented 6 months ago

The threshold is mostly to filter out candidates in large sets. Though everything being discarded could indicate that there is little of actual value. I'll merge your change back into the PR.

For production use you'd probably want to set up a fuzz test to check that border cases are handled.

klauspost / compress

BuildZStdDict Testing #950