Closed pushshift closed 4 years ago
Indeed It would be great to have this possibility in gozstd
!
On some datasets we're using at my company, giving a specific window size sometimes gives important savings, from which the following plots are examples, by trading memory for speed and/or compression ratio.
ratio is: original/compressed speed (C): compression speed in MB/s
Wow! These are good improvements in compression ratio, @arl . Will take a look into adding support for custom window sizes soon.
Hi @valyala .
Would you accept a PR for this?
I currently have a slightly modified version of gozstd
, based on today's master, which supports providing the window size option.
The code to add support to custom window size (through the windowLog zstd parameter) is relatively easy to add. The main doubt I have is about the API.
I guess the change should be backward compatible, we don't want to trigger a new major version for that.
Setting the window size is mostly useful for gozstd.Writer
, which already has various constructors: NewWriter
, NewWriterLevel
and NewWriterLevelDict
.
Now, adding a new variant for each of those is not difficult, but it's neither elegant nor future-proof.
What would happen if after window size another parameter were to be added? The number of different NewWriterXXX
constructors would quickly undergo a combinatorial explosion
I dont know if this issue is the best place to discuss this but I was thinking to something like:
package gozstd
type WriterParams struct {
CompressionLevel *int
WindowLog *int
Dict *CDict
}
func NewWriterParams(w io.Writer, p WriterParams) *Writer {
...
}
Each field of WriterParams
is a pointer so that nil values indicate wether the parameter has been set. This can easily be evolved without breaking backward compatibility if new parameters are added and existing fields are never removed.
In the meantime I'm currently implementing, in a fork since I need it, the first proposition, adding a variant for each of the currently existing Writer constructors, but happy to switch path. Let me know.
The proposed API looks good in general, but I'd use non-pointer int
values inside it:
type WriterParams struct {
CompressionLevel int // 0 means 'default compression level'
WindowLog int // 0 means 'default windowLog'
Dict *CDict
}
Then this API will be easier to use. Compare:
wp := WriterParams{
CompressionLevel: 42,
WindowLog: 2,
}
to
compressionLevel := 42
windowLog := 2
wp := WriterParams{
CompressionLevel: &compressionLevel,
WindowLog: &windowLog,
}
Great. Makes sense
The default for zstd appears to be 27 (the --long option). It would be great if this could be exposed via this module. I believe this is the relevant code from zstandard's code repository: https://github.com/facebook/zstd/blob/fbcae274a4752df2c386778808948e88fb89b295/lib/compress/zstd_ldm.c
Thanks for the awesome work!