valyala / gozstd

go wrapper for zstd
MIT License
421 stars 60 forks source link

Add support for custom window sizes (long distance matching) #7

Closed pushshift closed 4 years ago

pushshift commented 5 years ago

The default for zstd appears to be 27 (the --long option). It would be great if this could be exposed via this module. I believe this is the relevant code from zstandard's code repository: https://github.com/facebook/zstd/blob/fbcae274a4752df2c386778808948e88fb89b295/lib/compress/zstd_ldm.c

Thanks for the awesome work!

arl commented 4 years ago

Indeed It would be great to have this possibility in gozstd!

On some datasets we're using at my company, giving a specific window size sometimes gives important savings, from which the following plots are examples, by trading memory for speed and/or compression ratio.

imp results eve results

ratio is: original/compressed speed (C): compression speed in MB/s

valyala commented 4 years ago

Wow! These are good improvements in compression ratio, @arl . Will take a look into adding support for custom window sizes soon.

arl commented 4 years ago

Hi @valyala . Would you accept a PR for this? I currently have a slightly modified version of gozstd, based on today's master, which supports providing the window size option.

arl commented 4 years ago

The code to add support to custom window size (through the windowLog zstd parameter) is relatively easy to add. The main doubt I have is about the API.

I guess the change should be backward compatible, we don't want to trigger a new major version for that.

Setting the window size is mostly useful for gozstd.Writer, which already has various constructors: NewWriter, NewWriterLevel and NewWriterLevelDict.

Now, adding a new variant for each of those is not difficult, but it's neither elegant nor future-proof.

What would happen if after window size another parameter were to be added? The number of different NewWriterXXX constructors would quickly undergo a combinatorial explosion

I dont know if this issue is the best place to discuss this but I was thinking to something like:

package gozstd

type WriterParams struct {
    CompressionLevel *int
    WindowLog *int
    Dict *CDict
}

func NewWriterParams(w io.Writer, p WriterParams) *Writer {
    ...
}

Each field of WriterParams is a pointer so that nil values indicate wether the parameter has been set. This can easily be evolved without breaking backward compatibility if new parameters are added and existing fields are never removed.

In the meantime I'm currently implementing, in a fork since I need it, the first proposition, adding a variant for each of the currently existing Writer constructors, but happy to switch path. Let me know.

valyala commented 4 years ago

The proposed API looks good in general, but I'd use non-pointer int values inside it:

type WriterParams struct {
    CompressionLevel int  // 0 means 'default compression level'
    WindowLog int  // 0 means 'default windowLog'
    Dict *CDict
}

Then this API will be easier to use. Compare:

wp := WriterParams{
    CompressionLevel: 42,
    WindowLog: 2,
}

to

compressionLevel := 42
windowLog := 2
wp := WriterParams{
    CompressionLevel: &compressionLevel,
    WindowLog: &windowLog,
}
arl commented 4 years ago

Great. Makes sense

valyala commented 4 years ago

The WriterParams API is available starting from v1.7.0.

@arl , thanks for great contribution!