nayuki / DEFLATE-library-Java

Efficient DEFLATE compressor and decompressor in pure Java.
https://www.nayuki.io/page/deflate-library-java
15 stars 5 forks source link

Suggestion: multi-member GZIP and unmarkable streams #2

Open ajohnson1 opened 3 years ago

ajohnson1 commented 3 years ago

The GZIP RFC 1952 allows multiple members, where each member has a header, a deflated section and a trailer.

It's possible to handle these using DEFLATE-library-Java by using a markable underlying stream, reading the header from the underlying stream, using the inflater, using detach to reset underlying stream, read the trailer from the underlying stream, read the next header if present from the underlying stream, then creating a new deflater. This works, and normally it is possible to make a stream markable by wrapping it in a BufferedStream.

Sometimes this isn't so easy. For example, Eclipse Memory Analyzer creates a random access view of a GZIP file by having multiple GZIPInputStreams all based on the same underlying RandomAccessFile, and switching the underlying stream to the correct position for a GZIPInputStream before using that stream. This goes wrong with multiple members as it isn't so easy to switch the mark positions as well, though it could be done.

An idea I had was to have a different detach mode where after the end of a member (and a -1 return) the caller could detach the inflater and start reading from the inflater the unprocessed data from the input buffer and then the underlying stream. Once the caller had read new header then the inflater could be restarted on a new member with an attach() call. This could be hidden inside a GZIPInputStream so the caller of that never need to be aware of the multiple members.

I've made the change to a private version of InflaterInputStream used by Memory Analyzer but perhaps people have some better ideas as to how this could be accomplished.

nayuki commented 3 years ago

The GZIP RFC 1952 allows multiple members, where each member has a header, a deflated section and a trailer.

You're right, wow! I didn't consciously notice that part when I first read the GZIP specification ~10 years ago. Looks like what you're talking about is section 2.2. I apologize for missing this detail and not designing my library to accommodate this legitimate (though uncommon) use case.

I don't know what to say right now because I haven't loaded this project into human memory. I'll take your feature suggestion into consideration the next time I have the opportunity to revisit this codebase.

nayuki commented 1 year ago

After studying the behavior of the standard gunzip program on concatenated gzip files, I was disappointed to see that it just concatenates all their decompressed data to produce a single output file. There was no attempt to signal that the file has multiple parts or handle each part's embedded file name. So at least on this front, I can't see envision making behavior changes to gunzip.java.

As for your situation, the behavior you described in Eclipse Memory Analyzer sounds sketchy - the notion of basing multiple GZIPInputStream objects on one RandomAccessFile object. As for how this relates to my library...

First off, I apologize for the delay, but also of "innovating" on the new concept of detach() which differs from close(). I realize there's some sketchy stuff in Java I/O conventions, like how when you have new FilterInputStream(new FileInputStream()), the wrapper is expected to close the inner stream - but there are many cases where this is not the desired behavior. There are also questions about how much the wrapper stream is allowed to "over-read" the inner stream.

One thing I want to make clear: My DEFLATE library will only allocate memory (whether pure Java memory or even native memory); it will not allocate sensitive resources like file handles, sockets, etc. that require timely clean up. Therefore, it is always valid to discard my InflaterInputStream and continue to use the underlying stream. This doesn't seem to be a convention that Java I/O libraries promote. You can find examples in my other work, where I create and discard DataInputStream and CheckedInputStream over a substring of the full input sequence... and Eclipse warns me with "Resource leak: 'in' is never closed".

Regarding your idea of "start reading from the inflater the unprocessed data from the input buffer and then the underlying stream", my worry is that now you either have to engage in a dance to read the correct number of bytes before discarding the inflater so that you can directly read the underlying stream, or you end up with an increasing number of layers of inflaters after the end of each DEFLATE stream.

One alternate idea is that instead of using markable streams, the underlying stream could be a PushbackInputStream, so that upon detaching, my inflater would "push back" all the unused bytes to the underlying stream. This avoids manually splicing between the inflater and the underlying stream, and also avoids an unbounded number of wrapper streams.

I do appreciate that you took care of the problem yourself, as there is no way I can study and cater to each application/usage individually. Anyway, let me know what you think of my analysis.