gunzip truncates concatinated gzip streams

lpsmith commented 8 years ago

I ran across this issue when working on my ftp-monitor project.

Anyway, here's a fairly minimal reimplementation of zcat, which is sufficient to demonstrate the problem:

module Main (main) where

import qualified System.IO as IO
import qualified System.IO.Streams as Streams
import           System.Environment (getArgs)

main = do
    (filename:_) <- getArgs
    IO.withFile filename IO.ReadMode $ \h -> do
         inS <- Streams.handleToInputStream h >>= Streams.gunzip
         Streams.connect inS Streams.stdout

Now, if I run this on a big file, it truncates output. (Curiously, it seems to be truncating output on a newline on this and several other files).

lpsmith@church:~/src/ftp-monitor/zcat
$ wc <(cat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
  28291  161049 7303701 /dev/fd/63
$ wc <(dist/build/zcat/zcat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
  14345  313055 4402405 /dev/fd/63
$ wc <(zcat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
 454378 9861255 139101003 /dev/fd/63
$ sha256sum <(dist/build/zcat/zcat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
ed347ba497088e0aaeb7a69d5bd694218c93a361b5683edbda73c22ff45dedd8  /dev/fd/63
$ sha256sum <(zcat ~/logs/cas/d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346.gz)
d5941d7e89c0d6306fc7eaa003c6389ad9e1dd7ba8b429fd0a9038344443e346  /dev/fd/63

I'll be investigating this more deeply and see if I can't track down the problem, but it would appear that the bug is with io-streams.

lpsmith commented 8 years ago

Actually, I'm wondering if io-streams zlib integration is incomplete, because it also appears the output is getting truncated on the first day of the month.

lpsmith commented 8 years ago

Ok, it turns out that it's entirely legitimate to concatinate gz streams, and gunzip and zcat will unzip each stream and concatinate the result.

So for example:

$ echo Hello | gzip >> foo.gz
$ echo World | gzip >> foo.gz
$ ./dist/build/zcat/zcat foo.gz 
Hello
$ zcat foo.gz 
Hello
World
$ gunzip foo.gz
$ cat foo
Hello
World

So my "minimal reimplementation" of zcat isn't. One could argue that this isn't a bug in io-streams, however, the semantics of Streams.gunzip is rather different than gunzip, which is horrible UX.

At the very least, this needs to be mentioned in the documentation. And we probably need to implement a proper gunzip analog. Ideally, I think Streams.gunzip should be renamed (perhaps gunzipOne?) and the proper analog should be named Streams.gunzip.

lpsmith commented 8 years ago

And, as it turns out, this doesn't work:

gunzips :: InputStream ByteString -> IO (InputStream ByteString)
gunzips = multiTransformInputStream Streams.gunzip

multiTransformInputStream :: (InputStream a -> IO (InputStream b))
                          -> (InputStream a -> IO (InputStream b))
multiTransformInputStream trans input = do
    input' <- newIORef =<< trans input
    Streams.makeInputStream (readS ref)
  where
    readS ref = do
      mb <- Streams.read =<< readIORef input'
      case mb of
        (Just _b) -> return mb
        Nothing   -> do
           isAtEOF <- Streams.atEOF input
           if isAtEOF
           then return Nothing
           else do
             writeIORef input' =<< trans input
             readS ref

The issue is that the first time multiTransformInputStream reaches the end of a decompressed Streams.gunzip input stream, Streams.atEOF returns True on the compressed stream, meaning that Streams.gunzip consumed the whole stream (7 MB in my first example), but only produced the first decompressed stream (4 of 140 MB in my first example). Which means I can't simply work around this issue, I have to bypass io-streams zlib integration altogether. I no longer think there's an argument that Streams.gunzip is correct.

lpsmith commented 8 years ago

Alright, I inspected the zlib-bindings, and it was clear that the current interface exposed fundamentally does not support this use case. So I jaunted over to the world of conduit to see what the story is there; here are the relevant issues: fpco/streaming-commons#20 and snoyberg/conduit#254

So at the very least, we are going to have to port the patches from streaming-commons to zlib-bindings, or possibly switch to streaming-commons.

lpsmith commented 8 years ago

Ok, as far as streaming-commons's dependencies are concerned, eyeballing it, it looks like the only transitive dependencies it would add to io-streams (that aren't already packaged with GHC) are, async, blaze-builder, random, and stm. These are all smallish libraries with minimal dependencies themselves.

So, I'd be willing to make the changes necessary to zlib-bindings, but I wanted to hear your opinion on moving to streaming-commons.

hvr commented 8 years ago

What would we need to add to zlib to keep io-streams free of streaming-commons?

lpsmith commented 8 years ago

zlib-bindings needs the ability to signal the end of an gz stream, and return any leftovers that you fed to it but aren't part of the stream.

This appears to be the relevant patch to streaming-commons: fpco/streaming-commons@b4666864bba4e93bcc30b15352c40def5d29b8ea

hvr commented 8 years ago

@lpsmith btw, I really meant zlib; I seem to recall that there were some issues/limitations that kept io-streams from switching from zlib-bindings (which was declared deprecated iirc) to zlib.

Not too long ago I imported zlib into github at https://github.com/haskell/zlib in the hopes to get zlib to the point where it can pick up former users of zlib-bindings

lpsmith commented 8 years ago

@hvr, ahh, thank you for the clarification.

Eyeballing it, it would appear that the DecompressStream from the Codec.Compression.Zlib.Internal module from zlib-0.6 onwards should be capable of covering this use case. However, at first glance, I don't see anything in zlib versions < 0.6 that would work.

So it looks like, as of last year, there is nothing preventing io-streams from depending on zlib alone. However, how stable is that interface going to be?

hvr commented 8 years ago

@lpsmith ah, now I think I remember what the missing thing was: flushing support in the compression-stream. That's where the API I imitated for lzma diverges from zlib's state-machine data-type:

http://hackage.haskell.org/package/lzma-0.0.0.2/docs/Codec-Compression-Lzma.html#t:CompressStream

Note the additional field in CompressInputRequired; I had to add this to implement lzma-streams properly.

I'll talk to @dcoutts, but I think after adding support for flushing (https://github.com/haskell/zlib/issues/6) we should promote the incremental API into a non-.Internal module.

lpsmith commented 8 years ago

Well, I'm looking at just decompression at the moment, and don't anticipate needing compression anytime soon on my current project. I should probably code up a proof of concept though.

lpsmith commented 8 years ago

Hmm, actually, your lzma package looks very nice. I might adopt it in some of my projects; last time I looked at lzma bindings on hackage, I decided to stick to piping my data through xz subprocesses instead.

lpsmith commented 8 years ago

Ok, I have a proof of concept done for the decompression side. Take a look at lpsmith/io-streams@f39178b14c29903265e3b18b435e7cd81f4dcbe4

snapframework / io-streams

gunzip truncates concatinated gzip streams #56