linkedin / migz

Multithreaded, gzip-compatible compression and decompression, available as a platform-independent Java library and command-line utilities.
BSD 2-Clause "Simplified" License
77 stars 12 forks source link

munzip fails on gzipped files #5

Open WadeWalker opened 3 years ago

WadeWalker commented 3 years ago

Hi guys,

Thanks for the great library! It's made some of my Java programs that were I/O bound much faster :)

However, it's currently hard to use MiGz unless you can guarantee that it will never try to unzip a file zipped by vanilla gzip. If you do, it crashes:

% gzip junk.csv
% munzip < junk.csv.gz > junk.csv
Decompressing stdin using 40 threads
Exception in thread "main" java.lang.RuntimeException: java.util.zip.DataFormatException: invalid code lengths set
    at com.linkedin.migz.MiGzInputStream.decompressorThreadWithInflater(MiGzInputStream.java:208)
    at com.linkedin.migz.MiGzInputStream.decompressorThread(MiGzInputStream.java:114)
    at com.concurrentli.Interrupted.lambda$ignored$1(Interrupted.java:48)
    at java.base/java.lang.Thread.run(Thread.java:830)
Caused by: java.util.zip.DataFormatException: invalid code lengths set
    at java.base/java.util.zip.Inflater.inflateBytesBytes(Native Method)
    at java.base/java.util.zip.Inflater.inflate(Inflater.java:378)
    at java.base/java.util.zip.Inflater.inflate(Inflater.java:464)
    at com.linkedin.migz.MiGzInputStream.decompressorThreadWithInflater(MiGzInputStream.java:192)
    ... 3 more

Would it be possible for MiGz to just decompress serially in this case? The opposite seems to work fine; I can decompress MiGz files with gunzip with no problem.

This is MiGz 1.0.1 running under Oracle OpenJDK 13.0.2 on RedHat 7.4.

jeffpasternack commented 3 years ago

Hi Wade,

Glad it's been useful for you, and thank you for filing this issue. Not decompressing non-MiGz gzip files is a known limitation of MiGz, but apparently not well-known enough, which suggests that, at minimum, there's an opportunity for us to improve the documentation here.

Could you help us understand why you'd want to decompress non-MiGz file data with MiGz? It seems like non-MiGz and MiGz data can be readily distinguished by filename extensions or (more generally) other context, and you can then choose to use munzip or gzip as appropriate (or, in code, the MiGzInputStream or GZIPInputStream classes).

Detecting and then processing non-MiGz gzip data within the MiGzInputStream class would actually be a relatively complex endeavor: we'd need to check the header, and, if it's not MiGz data, somehow push the data back onto the stream (this would require writing a new stream wrapper class), give that stream to GZipInputStream, shut down any worker threads and then delegate all calls to that stream.

On the other hand, we could more feasibly add explicit detection of non-MiGz headers and throw a more informative exception. Except for very small block sizes the computation cost for this check would be trivial--please let us know if this would be useful.

Thanks, Jeff

WadeWalker commented 3 years ago

Hi Jeff,

In my particular use-case, I support a flow that consists of two processes that can either be invoked together or separately. The first process writes out .gz files, and the second process reads them in. If the user invokes the whole flow together, I can do both sides with MiGz and get that sweet performance boost. But if the user invokes only the second process, and they feed in their own .gz input files that they created with gzip, MiGz throws the error I showed.

As you say, I could name MiGz files as *.migz and gzip files as *.gz to make it explicit which program created which file. But this abandons what I think is the coolest part of MiGz, which is that it's (almost) able to be used interchangeably with GZIP{Input|Output}Stream. MiGz already writes completely legal .gz files that can be unzipped by gunzip. If it could also read .gz files produced by gzip and just decode them serially, it would be gloriously symmetrical, and it would be invisible to the user what file was created by who (except for the performance benefits of MiGz). Programmers could just replace all occurrences of GZIP{Input|Output}Stream in their code with MiGz{Input|Output}Stream, and performance would transparently improve when possible, and it would still be functional in all other cases.

You know the code base, so maybe there's a reason why this would be horrible to implement. I was assuming that MiGz reads the header before spawning its worker threads. Perhaps if the header doesn't contain the MiGz-specific EXTRA fields, you could just spawn one worker, since the whole file is one big block, and decode as normal?

jeffpasternack commented 3 years ago

Hi Wade,

You have a good point with respect to being able to blindly replace GZip* with MiGz*. MiGz does sort-of read the header when decompressing, but since it's expecting a MiGz-compressed file it only reads the variable fields at (mostly) known, static locations. Being able to correctly decompress any valid GZip data is, unfortunately, not as easy as it appears: we'd have to be ready to switch to single-threaded behavior when we first hit a non-MiGz block (a legal GZip file could be a concatenation of a MiGz file and a non-MiGz GZip file) and then delegate all remaining data reads to an internal GZipInputStream.

It's definitely doable, but would add substantial complexity and, more importantly, technical risk (it's very hard to exhaustively test multithreaded code--this is why our current release is still a "beta" :) ). Unless there was widespread extant demand I don't think we'd want to implement this since it's a large investment for a small-to-moderate convenience for applicable clients (you can already replace GZipInputStream with something like fileName.endsWith(".mz") ? MiGzInputStream : GZipInputStream).

In the meantime, I do think we should improve the documentation to make this limitation more clear as well as more informative exceptions.

WadeWalker commented 3 years ago

Hi Jeff,

It's all good -- you're the ultimate arbiter of what makes sense for your codebase :) In my case, I could cheat if I needed to, and pre-read the file header to see if it's a MiGz file or not, then return the appropriate stream type (since I start from Path instead of Stream elsewhere in my code, so I wouldn't have to re-stuff the stream, I could just re-read).

Just to make sure I fully understand MiGz' intended behavior now, could you confirm:

If this is the contract, I can work with this :) I just want to make sure I haven't still got it twisted somehow.

Also, for your future consideration in case you ever revisit this issue, it would be fine for me if MiGz didn't handle the case of a concatenated MiGz/non-MiGz file. I wouldn't mind making the assumption that if a file contains any MiGz data, it will all be MiGz data, and erroring out if that's not the case. You're totally correct that to be 100% gzip compatible you'd have to handle this, but I think just handling the simplest cases of all-MiGz and all-gzip would get us 99% of the benefits.