Open gudvinr opened 3 years ago
cc @dsnet
WebSocket permessage-deflate compression extension (RFC7692) can reuse LZ77 sliding window for future messages which requires maintaining inflate state so you could feed incoming messages to it.
Have you tried using NewReaderDict
and NewWriterDict
? If so, why are these insufficient? The "dictionary" is the LZ77 sliding window prior to the current decompression offset.
How could you extract said dictionary so you can feed it to NewReaderDict
?
flate.Writer
also doesn't have methods for exporting its dictionary to reuse for consecutive writes.
I am by no means a specialist in compression algorithms.
From what I know, inflate can back-reference occurrences of same data blocks within sliding window, which is at most 32KiB IIRC.
So, I suppose you should maintain circular buffer by yourself by overwriting content with last read data. Is that correct?
So, I suppose you should maintain circular buffer by yourself by overwriting content with last read data. Is that correct?
Correct. The dictionary is literally the last 32KiB of the most recently compressed input or decompressed output. At the very least, you could use a io.MultiWriter
to siphon off the last parts of the buffer.
We could add methods to directly retrieve the buffer with something like:
// ExportDictionary dictionary copies the internal dictionary into b
// and reports the number of bytes copied.
// A buffer of at least 32KiB should be provided.
func (*Writer) ExportDictionary(b []byte) int
It's highly unfortunate that NewReader
and NewWriter
are asymmetrical. NewReader
returns an io.ReadCloser
, so it would be difficult to add new methods to the underlying decompressor
type.
OK, dictionary itself isn't an issue then, thanks. Adding new method to stdlib isn't necessary in that case. At least not right now.
But there is still an issue with reader (writer is probably fine) because it's impossible to know whether it finished to read all decompressed data from provided chunk or not. I don't think that's a good idea to rely on io.UnexpectedEOF
and use Resetter
.
There's no way to know if io.UnexpectedEOF
is indeed an unexpected EOF that caused by e.g. corrupted data or it's an end of current chunk.
Edit: I tried using fixed buffer for incoming stream and then passing that to NewReaderDict
and it seems to be working as expected.
However, there should be a way to reliably detect when reader consumed all input.
How about some kind of NewFlushReaderWithDict(r io.Reader, dict []byte, flush int)
? flush int
is a flag to indicate whether we expecting to read up until Z_SYNC_FLUSH
or Z_FINISH
(as it is now). It is similar to what inflate
in zlib does.
This reader could return io.EOF
when it consumed all of the reader content and encountered Z_SYNC_FLUSH
at the end of data block.
Old NewReaderDict
could just call NewFlushReaderWithDict
with ZFinish
argument.
@dsnet can you please take another look and give some feedback?
I don't think that shipping production code that uses io.UnexpectedEOF
as a way to detect expected EOF is a good idea. So I still would appreciate ability to reliably detect boundaries inside zlib streams.
In addition to relying on io.UnexpectedEOF
, it doesn't seem like you can "just" set the window to the last 32KiB read. It seems like the dictionary must also take into account not appending data into the circular buffer that was decoded via dictionary back references. I could be very wrong about this assumption though, but trying to create my own reader that resets the underlying each time a z sync flush happens, the dictionary eventually becomes corrupted when introducing data (I think) that makes back references. For now I'm using https://github.com/DataDog/czlib, but ideally other users wouldn't hit this unexpected behavior.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
WebSocket
permessage-deflate
compression extension (RFC7692) can reuse LZ77 sliding window for future messages which requires maintaining inflate state so you could feed incoming messages to it.Also Discord uses zlib streaming compression in a similar manner.
In both cases sender uses single context for all consecutive messages that end with
Z_SYNC_FLUSH = 0x0000FFFF
.If you decompress these messages without resetting inflate state it wouldn't be possible but quite expected.
If you try to maintain inflate state, it's still doesn't seem possible to decompress separate messages within same context.
io.Reader
object returned fromflate.NewReader
doesn't provide information about stream state.Read
calls do not indicate when flate encountersZ_SYNC_FLUSH
.Neither inflate nor WebSocket protocol have information about size of decompressed payload (and doesn't need to correctly work) so you can't allocate correctly-sized buffer for reading without going out of message boundaries.
If you try to call
Read
onio.Reader
fromflate.NewReader
, you'll getio.UnexpectedEOF
as soon as first compressed message ends. You can't also feed data to that reader anymore even if you ignoreio.UnexpectedEOF
(which is good). It can be easily reproduced by usingio.ReadAll
.It is understandable why it works like that and not an issue by itself.