compress/flate: can't reuse inflate state for split data packets

gudvinr commented 3 years ago

What version of Go are you using (`go version`)?

$ go version
go version go1.17.1 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="~/.cache/go-build"
GOENV="~/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="~/tools/go/path/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="~/tools/go/path"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/lib/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.1"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1090573846=/tmp/go-build -gno-record-gcc-switches"

What did you do?

WebSocket permessage-deflate compression extension (RFC7692) can reuse LZ77 sliding window for future messages which requires maintaining inflate state so you could feed incoming messages to it.

Also Discord uses zlib streaming compression in a similar manner.

In both cases sender uses single context for all consecutive messages that end with Z_SYNC_FLUSH = 0x0000FFFF.

If you decompress these messages without resetting inflate state it wouldn't be possible but quite expected.

If you try to maintain inflate state, it's still doesn't seem possible to decompress separate messages within same context.
io.Reader object returned from flate.NewReader doesn't provide information about stream state. Read calls do not indicate when flate encounters Z_SYNC_FLUSH.

Neither inflate nor WebSocket protocol have information about size of decompressed payload (and doesn't need to correctly work) so you can't allocate correctly-sized buffer for reading without going out of message boundaries.

If you try to call Read on io.Reader from flate.NewReader, you'll get io.UnexpectedEOF as soon as first compressed message ends. You can't also feed data to that reader anymore even if you ignore io.UnexpectedEOF (which is good). It can be easily reproduced by using io.ReadAll.
It is understandable why it works like that and not an issue by itself.

var flate io.ReadCloser

data := receiveCompressedMessageFromWebSocket()
buf := bytes.NewBuffer(data)

flate = flate.NewReader(buf)
msg, err := io.ReadAll(flate)
// err will have io.UnexpectedEOF

seankhliao commented 3 years ago

cc @dsnet

dsnet commented 3 years ago

WebSocket permessage-deflate compression extension (RFC7692) can reuse LZ77 sliding window for future messages which requires maintaining inflate state so you could feed incoming messages to it.

Have you tried using NewReaderDict and NewWriterDict? If so, why are these insufficient? The "dictionary" is the LZ77 sliding window prior to the current decompression offset.

gudvinr commented 3 years ago

How could you extract said dictionary so you can feed it to NewReaderDict?

flate.Writer also doesn't have methods for exporting its dictionary to reuse for consecutive writes.

I am by no means a specialist in compression algorithms. From what I know, inflate can back-reference occurrences of same data blocks within sliding window, which is at most 32KiB IIRC.
So, I suppose you should maintain circular buffer by yourself by overwriting content with last read data. Is that correct?

dsnet commented 3 years ago

So, I suppose you should maintain circular buffer by yourself by overwriting content with last read data. Is that correct?

Correct. The dictionary is literally the last 32KiB of the most recently compressed input or decompressed output. At the very least, you could use a io.MultiWriter to siphon off the last parts of the buffer.

We could add methods to directly retrieve the buffer with something like:

// ExportDictionary dictionary copies the internal dictionary into b
// and reports the number of bytes copied.
// A buffer of at least 32KiB should be provided. 
func (*Writer) ExportDictionary(b []byte) int

It's highly unfortunate that NewReader and NewWriter are asymmetrical. NewReader returns an io.ReadCloser, so it would be difficult to add new methods to the underlying decompressor type.

gudvinr commented 3 years ago

OK, dictionary itself isn't an issue then, thanks. Adding new method to stdlib isn't necessary in that case. At least not right now.

But there is still an issue with reader (writer is probably fine) because it's impossible to know whether it finished to read all decompressed data from provided chunk or not. I don't think that's a good idea to rely on io.UnexpectedEOF and use Resetter.
There's no way to know if io.UnexpectedEOF is indeed an unexpected EOF that caused by e.g. corrupted data or it's an end of current chunk.

Edit: I tried using fixed buffer for incoming stream and then passing that to NewReaderDict and it seems to be working as expected.
However, there should be a way to reliably detect when reader consumed all input.

How about some kind of NewFlushReaderWithDict(r io.Reader, dict []byte, flush int)? flush int is a flag to indicate whether we expecting to read up until Z_SYNC_FLUSH or Z_FINISH (as it is now). It is similar to what inflate in zlib does.
This reader could return io.EOF when it consumed all of the reader content and encountered Z_SYNC_FLUSH at the end of data block.
Old NewReaderDict could just call NewFlushReaderWithDict with ZFinish argument.

gudvinr commented 3 years ago

@dsnet can you please take another look and give some feedback?
I don't think that shipping production code that uses io.UnexpectedEOF as a way to detect expected EOF is a good idea. So I still would appreciate ability to reliably detect boundaries inside zlib streams.

edaniels commented 2 years ago

In addition to relying on io.UnexpectedEOF, it doesn't seem like you can "just" set the window to the last 32KiB read. It seems like the dictionary must also take into account not appending data into the circular buffer that was decoded via dictionary back references. I could be very wrong about this assumption though, but trying to create my own reader that resets the underlying each time a z sync flush happens, the dictionary eventually becomes corrupted when introducing data (I think) that makes back references. For now I'm using https://github.com/DataDog/czlib, but ideally other users wouldn't hit this unexpected behavior.

golang / go