golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.87k stars 17.52k forks source link

compress: add AppendCompress and AppendDecompress #54078

Open dsnet opened 2 years ago

dsnet commented 2 years ago

There are situations where we have the entire input on hand and want to compress/decompress it into a buffer we already have on hand. I propose the addition of the following functions to gzip, flate, zlib:

// AppendEncoded appends the compressed form of src to the end of dst
// using the specified compression level.
// It returns an error only if the specified level is invalid.
func AppendEncoded(dst, src []byte, level int) ([]byte, error)

// AppendDecoded appends the decompressed form of src to the end of dst.
// If the decompressed output exceeds maxSize, it returns an error.
// If maxSize is negative, then there is no limit on the output size.
func AppendDecoded(dst, src []byte, maxSize int) ([]byte, error)

At face value, this appears to be yet another proposal for convenience functions (which has been proposed and rejected before: #16504). However, there are CPU and memory performance benefits to this API that is impossible to achieve otherwise.

Compression is the combination of LZ77 and Huffman encoding. The memory used by LZ77 is essentially the last 32KiB of uncompressed data. When operating with an io.Reader or io.Writer, the compressor/decompressor is responsible for maintaining a copy of the LZ77 window. This is quite a bit of copying and extra memory usage. The advantage of the APIs above is that there is zero work that needs to be done to maintain the LZ77 window separately since the AppendEncoded.src and AppendDecoded.dst are the LZ77 windows. When decompressing, the only memory required is the memory for the Huffman decoder, which is relatively small. When compressing, the only memory required is the memory for the LZ77 search structure and Huffman encoder, which is still sizeable, but still smaller than the entire LZ77 window.

Other considerations: The flate and zlib APIs provide the ability to specify a dictionary and the proposed APIs have no means of specifying such a feature. It appears that dictionaries are rarely used such that adding it as part of the API is probably not worth it. Usage of dictionaries can still use the io.Reader and io.Writer based APIs.

ianlancetaylor commented 2 years ago

I don't see anywhere to keep state here. Is this the correct usage? In particular, are we required to always keep passing the same destination slice to each call? Is there really no other state required?

    buf := make([]byte, 4096)
    var dst []byte
    for {
        n, err := r.Read(buf)
        if err != nil && err != io.EOF {
            return nil, err
        }
        if n == 0 {
            break
        }
        dst, err = gzip.AppendEncoded(dst, buf[:n], 9)
        if err != nil {
            return nil, err
        }
    }
    return dst, nil
dsnet commented 2 years ago

The intention of the proposed API is for use when you already have a []byte in its entirety on hand and you want to continue dealing with it as a []byte. Anytime, you start with or end with a io.Reader or io.Writer, then the current API is superior.

For your example, you could do:

b, err := io.ReadAll(r) // converts io.Reader to []byte
if err != nil {
    return nil, err
}
return gzip.AppendEncoded(nil, b, 9) // stays in the []byte world

but I'm not sure if that's any faster or slower than doing:

var bb bytes.Buffer
zw := gzip.NewWriter(bb) // stays in the io.Writer world
if _, err := io.Copy(zw, r); err != nil { // connects io.Writer and io.Reader together
    return nil, err
}
zw.Close()
return bb.Bytes(), nil // converts io.Writer into a []byte

Calling gzip.AppendEncoded within each read-loop would be a fairly odd use of the API. It is technically correct since each call to gzip.AppendEncoded would append the compressed form of buf[:n] as a individual gzip files to the destination. A sequence of gzip files concatenated together is still a valid gzip file per RFC 1952. Since the chunk size is 4k, I would not expect compression ratio in this case to be particularly great since we're functionally limiting the compression window to 4k.

rsc commented 2 years ago

This API would remove the allocation of the window. How much can we remove of the other allocated state? Can these be implemented with no allocation other than the append? (That is, can we get everything to be stack-allocated? Or were you planning on a sync.Pool?)

dsnet commented 2 years ago

I was planning on using a sync.Pool. With work, I'm fairly certain the huffman coders can be stack allocated (since they're fairly constant size), not sure about the LZ77 search state.

rsc commented 2 years ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. ā€” rsc for the proposal review group

ianlancetaylor commented 2 years ago

Is Encoded really the right word here? Should these be named AppendCompressed and AppendDecompressed?

dsnet commented 2 years ago

I'm fine with either AppendEncoded/AppendDecoded or AppendCompressed/AppendDecompressed. I see "compression" as a particular type of "encoding". Decoded is shorter than Decompressed and the fact that we're dealing with compression is implied by the package name: gzip.AppendDecoded.

rsc commented 2 years ago

strconv.AppendQuote is not AppendQuoted (no d). So it seems like we can use AppendCompress and AppendDecompress, which will be very clear. Quoting is a kind of encoding too, of course, but it's probably clearer to say the encoding than just 'Encoded'.

Does anyone object to AppendCompress and AppendDecompress?

rsc commented 2 years ago

Based on the discussion above, this proposal seems like a likely accept. ā€” rsc for the proposal review group

rsc commented 2 years ago

No change in consensus, so accepted. šŸŽ‰ This issue now tracks the work of implementing the proposal. ā€” rsc for the proposal review group

gopherbot commented 2 years ago

Change https://go.dev/cl/429736 mentions this issue: compress/gzip: add AppendEncoded