Open lpsmith opened 8 years ago
Actually, I'm wondering if io-streams zlib integration is incomplete, because it also appears the output is getting truncated on the first day of the month.
Ok, it turns out that it's entirely legitimate to concatinate gz streams, and gunzip
and zcat
will unzip each stream and concatinate the result.
So for example:
$ echo Hello | gzip >> foo.gz
$ echo World | gzip >> foo.gz
$ ./dist/build/zcat/zcat foo.gz
Hello
$ zcat foo.gz
Hello
World
$ gunzip foo.gz
$ cat foo
Hello
World
So my "minimal reimplementation" of zcat
isn't. One could argue that this isn't a bug in io-streams, however, the semantics of Streams.gunzip
is rather different than gunzip
, which is horrible UX.
At the very least, this needs to be mentioned in the documentation. And we probably need to implement a proper gunzip
analog. Ideally, I think Streams.gunzip
should be renamed (perhaps gunzipOne
?) and the proper analog should be named Streams.gunzip
.
And, as it turns out, this doesn't work:
gunzips :: InputStream ByteString -> IO (InputStream ByteString)
gunzips = multiTransformInputStream Streams.gunzip
multiTransformInputStream :: (InputStream a -> IO (InputStream b))
-> (InputStream a -> IO (InputStream b))
multiTransformInputStream trans input = do
input' <- newIORef =<< trans input
Streams.makeInputStream (readS ref)
where
readS ref = do
mb <- Streams.read =<< readIORef input'
case mb of
(Just _b) -> return mb
Nothing -> do
isAtEOF <- Streams.atEOF input
if isAtEOF
then return Nothing
else do
writeIORef input' =<< trans input
readS ref
The issue is that the first time multiTransformInputStream
reaches the end of a decompressed Streams.gunzip
input stream, Streams.atEOF
returns True
on the compressed stream, meaning that Streams.gunzip
consumed the whole stream (7 MB in my first example), but only produced the first decompressed stream (4 of 140 MB in my first example). Which means I can't simply work around this issue, I have to bypass io-streams
zlib integration altogether. I no longer think there's an argument that Streams.gunzip
is correct.
Alright, I inspected the zlib-bindings
, and it was clear that the current interface exposed fundamentally does not support this use case. So I jaunted over to the world of conduit
to see what the story is there; here are the relevant issues: fpco/streaming-commons#20 and snoyberg/conduit#254
So at the very least, we are going to have to port the patches from streaming-commons
to zlib-bindings
, or possibly switch to streaming-commons
.
Ok, as far as streaming-commons
's dependencies are concerned, eyeballing it, it looks like the only transitive dependencies it would add to io-streams
(that aren't already packaged with GHC) are, async
, blaze-builder
, random
, and stm
. These are all smallish libraries with minimal dependencies themselves.
So, I'd be willing to make the changes necessary to zlib-bindings
, but I wanted to hear your opinion on moving to streaming-commons
.
What would we need to add to zlib to keep io-streams free of streaming-commons?
zlib-bindings
needs the ability to signal the end of an gz stream, and return any leftovers that you fed to it but aren't part of the stream.
This appears to be the relevant patch to streaming-commons
: fpco/streaming-commons@b4666864bba4e93bcc30b15352c40def5d29b8ea
@lpsmith btw, I really meant zlib
; I seem to recall that there were some issues/limitations that kept io-streams
from switching from zlib-bindings
(which was declared deprecated iirc) to zlib
.
Not too long ago I imported zlib
into github at https://github.com/haskell/zlib in the hopes to get zlib
to the point where it can pick up former users of zlib-bindings
@hvr, ahh, thank you for the clarification.
Eyeballing it, it would appear that the DecompressStream
from the Codec.Compression.Zlib.Internal
module from zlib-0.6
onwards should be capable of covering this use case. However, at first glance, I don't see anything in zlib versions < 0.6 that would work.
So it looks like, as of last year, there is nothing preventing io-streams
from depending on zlib
alone. However, how stable is that interface going to be?
@lpsmith ah, now I think I remember what the missing thing was: flushing support in the compression-stream. That's where the API I imitated for lzma
diverges from zlib
's state-machine data-type:
http://hackage.haskell.org/package/lzma-0.0.0.2/docs/Codec-Compression-Lzma.html#t:CompressStream
Note the additional field in CompressInputRequired
; I had to add this to implement lzma-streams
properly.
I'll talk to @dcoutts, but I think after adding support for flushing (https://github.com/haskell/zlib/issues/6) we should promote the incremental API into a non-.Internal
module.
Well, I'm looking at just decompression at the moment, and don't anticipate needing compression anytime soon on my current project. I should probably code up a proof of concept though.
Hmm, actually, your lzma package looks very nice. I might adopt it in some of my projects; last time I looked at lzma bindings on hackage, I decided to stick to piping my data through xz subprocesses instead.
Ok, I have a proof of concept done for the decompression side. Take a look at lpsmith/io-streams@f39178b14c29903265e3b18b435e7cd81f4dcbe4
I ran across this issue when working on my ftp-monitor project.
Anyway, here's a fairly minimal reimplementation of zcat, which is sufficient to demonstrate the problem:
Now, if I run this on a big file, it truncates output. (Curiously, it seems to be truncating output on a newline on this and several other files).
I'll be investigating this more deeply and see if I can't track down the problem, but it would appear that the bug is with io-streams.