Open dsnet opened 2 years ago
I don't see anywhere to keep state here. Is this the correct usage? In particular, are we required to always keep passing the same destination slice to each call? Is there really no other state required?
buf := make([]byte, 4096)
var dst []byte
for {
n, err := r.Read(buf)
if err != nil && err != io.EOF {
return nil, err
}
if n == 0 {
break
}
dst, err = gzip.AppendEncoded(dst, buf[:n], 9)
if err != nil {
return nil, err
}
}
return dst, nil
The intention of the proposed API is for use when you already have a []byte
in its entirety on hand and you want to continue dealing with it as a []byte
. Anytime, you start with or end with a io.Reader
or io.Writer
, then the current API is superior.
For your example, you could do:
b, err := io.ReadAll(r) // converts io.Reader to []byte
if err != nil {
return nil, err
}
return gzip.AppendEncoded(nil, b, 9) // stays in the []byte world
but I'm not sure if that's any faster or slower than doing:
var bb bytes.Buffer
zw := gzip.NewWriter(bb) // stays in the io.Writer world
if _, err := io.Copy(zw, r); err != nil { // connects io.Writer and io.Reader together
return nil, err
}
zw.Close()
return bb.Bytes(), nil // converts io.Writer into a []byte
Calling gzip.AppendEncoded
within each read-loop would be a fairly odd use of the API. It is technically correct since each call to gzip.AppendEncoded
would append the compressed form of buf[:n]
as a individual gzip files to the destination. A sequence of gzip files concatenated together is still a valid gzip file per RFC 1952. Since the chunk size is 4k, I would not expect compression ratio in this case to be particularly great since we're functionally limiting the compression window to 4k.
This API would remove the allocation of the window. How much can we remove of the other allocated state? Can these be implemented with no allocation other than the append? (That is, can we get everything to be stack-allocated? Or were you planning on a sync.Pool?)
I was planning on using a sync.Pool
. With work, I'm fairly certain the huffman coders can be stack allocated (since they're fairly constant size), not sure about the LZ77 search state.
This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. ā rsc for the proposal review group
Is Encoded
really the right word here? Should these be named AppendCompressed
and AppendDecompressed
?
I'm fine with either AppendEncoded/AppendDecoded
or AppendCompressed/AppendDecompressed
. I see "compression" as a particular type of "encoding". Decoded
is shorter than Decompressed
and the fact that we're dealing with compression is implied by the package name: gzip.AppendDecoded
.
strconv.AppendQuote is not AppendQuoted (no d). So it seems like we can use AppendCompress and AppendDecompress, which will be very clear. Quoting is a kind of encoding too, of course, but it's probably clearer to say the encoding than just 'Encoded'.
Does anyone object to AppendCompress and AppendDecompress?
Based on the discussion above, this proposal seems like a likely accept. ā rsc for the proposal review group
No change in consensus, so accepted. š This issue now tracks the work of implementing the proposal. ā rsc for the proposal review group
Change https://go.dev/cl/429736 mentions this issue: compress/gzip: add AppendEncoded
There are situations where we have the entire input on hand and want to compress/decompress it into a buffer we already have on hand. I propose the addition of the following functions to
gzip
,flate
,zlib
:At face value, this appears to be yet another proposal for convenience functions (which has been proposed and rejected before: #16504). However, there are CPU and memory performance benefits to this API that is impossible to achieve otherwise.
Compression is the combination of LZ77 and Huffman encoding. The memory used by LZ77 is essentially the last 32KiB of uncompressed data. When operating with an
io.Reader
orio.Writer
, the compressor/decompressor is responsible for maintaining a copy of the LZ77 window. This is quite a bit of copying and extra memory usage. The advantage of the APIs above is that there is zero work that needs to be done to maintain the LZ77 window separately since theAppendEncoded.src
andAppendDecoded.dst
are the LZ77 windows. When decompressing, the only memory required is the memory for the Huffman decoder, which is relatively small. When compressing, the only memory required is the memory for the LZ77 search structure and Huffman encoder, which is still sizeable, but still smaller than the entire LZ77 window.Other considerations: The
flate
andzlib
APIs provide the ability to specify a dictionary and the proposed APIs have no means of specifying such a feature. It appears that dictionaries are rarely used such that adding it as part of the API is probably not worth it. Usage of dictionaries can still use theio.Reader
andio.Writer
based APIs.