ruby / zlib

Ruby interface for the zlib compression/decompression library
Other
49 stars 35 forks source link

Zlib::GzipReader#readpartial until eof? causes EOFError #56

Closed martinemde closed 6 months ago

martinemde commented 1 year ago

I'm using the data.tar.gz from ruby-progressbar-1.13.0.gem, which seems to trigger this error. If I take exactly the same data.tar.gz, gunzip, and re-gzip it, then this bug doesn't happen. However, the gem unpacks fine, it's just readpartial that breaks.

$ gem fetch ruby-progressbar -v 1.13.0
$ tar zxf ruby-progressbar-1.13.0.gem data.tar.gz
irb> io = File.open('data.tar.gz')
=> #<File:data.tar.gz>
irb> io.size
=> 10250
irb> Zlib::GzipReader.wrap(io) { |gzio| gzio.readpartial(16_384) until gzio.eof? } # rubygems does this, but with #read
Traceback (most recent call last):
        7: from /usr/bin/irb:23:in `<main>'
        6: from /usr/bin/irb:23:in `load'
        5: from /Library/Ruby/Gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        4: from (irb):9
        3: from (irb):9:in `wrap'
        2: from (irb):9:in `block in irb_binding'
        1: from (irb):9:in `readpartial'
EOFError (end of file reached)

I have done some digging to verify I didn't cause this. It seems to require that you use this data.tar.gz that has a conflict with how zlib uses readpartial. Other data.tar.gz files work, and this same data.tar file re-gzipped works.

I think the EOF is raised here because it reaches the end of the file during gzfile_read_more(gz, outbuf); and then checks if it's at the end before finalizing the unzip. https://github.com/ruby/zlib/blob/master/ext/zlib/zlib.c#L2933

This does seem like it could be fixed. Using gzio.read(16_384) instead of gzio.readpartial works.

Edit2: The error is real but my understanding of it and why it happened was wrong. Leaving this open with changed title until the fix is merged.

This bug blocks rubygems from using readpartial to read gems.

martinemde commented 1 year ago

I'm running into a similar issue on jruby. It seems that readpartial on a gzipreader wrapped io does not always read as many bytes as you ask it to. This can cause the io.pos to be shifted, starting at the wrong position to line up with the tar header. This happens only if I use readpartial to move through the gzio, but not if I use gzio.read.

segiddins commented 11 months ago

I think that readpartial isnt guaranteed to read exactly len bytes.

From the IO docs: Reads up to maxlen bytes from the stream

also:

When blocked, the method waits for either more data or EOF on the stream:

If more data is read, the method returns the data.

If EOF is reached, the method raises [EOFError](dfile:///Users/segiddins/Library/Application%20Support/Dash/DocSets/Ruby_3/Ruby.docset/Contents/Resources/Documents/EOFError.html).

When not blocked, the method responds immediately:

Returns data from the buffer if there is any.

Otherwise returns data from the stream if there is any.

Otherwise raises [EOFError](dfile:///Users/segiddins/Library/Application%20Support/Dash/DocSets/Ruby_3/Ruby.docset/Contents/Resources/Documents/EOFError.html) if the stream has reached EOF.

So it seems like an EOFError should only be raise if no data was read, which I think is the opposite of the implementation today.

I think I have a fix though, and cutting down on the number of calls needed in a readpartial until eof? loop also seems like a good thing?

segiddins commented 8 months ago

I spent more time looking into this.

I ran gunzip -c test/zlib/fixtures/ruby-progressbar-1.13.0.gem.data.tar.gz | gzip -9 - >test/zlib/fixtures/ruby-progressbar-1.13.0.rezip.gem.data.tar.gz, then diffed the hexdumps of the two files:

❯ diff -U5 --label orig <(hexdump -C test/zlib/fixtures/ruby-progressbar-1.13.0.gem.data.tar.gz) --label new <(hexdump -C test/zlib/fixtures/ruby-progressbar-1.13.0.rezip.gem.data.tar.gz)
--- orig
+++ new
@@ -1,7 +1,7 @@
-00000000  1f 8b 08 00 c7 e8 02 64  02 03 ec 3d 7d 7f da 38  |.......d...=}..8|
+00000000  1f 8b 08 00 47 b5 83 65  02 03 ed 3d 7d 7f da 38  |....G..e...=}..8|
 00000010  d2 fb 37 9f 42 4d 9e 2e  90 25 06 f2 d6 3d b6 34  |..7.BM...%...=.4|
 00000020  4b 12 da 72 9b 84 fc 80  6c af 4f 36 3f d7 80 49  |K..r....l.O6?..I|
 00000030  7c 35 36 67 9b a4 d9 a6  cf 67 7f 66 f4 62 4b 7e  ||56g.....g.f.bK~|
 00000040  83 74 d3 f6 ae 07 db 4d  82 ad 19 8d 46 d2 68 66  |.t.....M....F.hf|
 00000050  34 1a 1d 77 0e db a7 fd  b6 16 7c 08 7e f8 52 9f  |4..w......|.~.R.|
@@ -636,8 +636,8 @@
 000027a0  4c 00 36 75 c7 a5 ed bd  1a cf af 99 72 09 5a ac  |L.6u........r.Z.|
 000027b0  ec 5e 8d fb 22 97 b8 d7  6c e9 d5 75 65 48 7d 4f  |.^.."...l..ueH}O|
 000027c0  eb ff 8d e9 61 bc ec 63  69 00 0b f7 7f f7 b6 e2  |....a..ci.......|
 000027d0  eb ff 4e 7d 95 ff e5 2b  fa ff e4 c5 9d 90 df db  |..N}...+........|
 000027e0  bd 3e 0b ce 2a d6 b5 fa  b6 56 0b b7 59 56 53 7d  |.>..*....V..YVS}|
-000027f0  f5 59 7d 56 9f d5 e7 7b  f9 fc 3f 00 00 00 ff ff  |.Y}V...{..?.....|
-00002800  03 00 19 b6 3b 4c 00 ea  00 00                    |....;L....|
-0000280a
+000027f0  f5 59 7d 56 9f d5 e7 7b  f9 fc 3f 19 b6 3b 4c 00  |.Y}V...{..?..;L.|
+00002800  ea 00 00                                          |...|
+00002803

What's interesting here is, other than one byte at the start of the deflate stream, the only diff is at the end.

So then I ran https://github.com/madler/infgen, and diffed that:

❯ diff -U5 --label orig <(~/Development/github.com/madler/infgen/a.out test/zlib/fixtures/ruby-progressbar-1.13.0.gem.data.tar.gz) --label new <(~/Development/github.com/madler/infgen/a.out test/zlib/fixtures/ruby-progressbar-1.13.0.rezip.gem.data.tar.gz)
--- orig
+++ new
@@ -1,9 +1,10 @@
 ! infgen 3.2 output
 !
 gzip
 !
+last
 dynamic
 litlen 0 10
 litlen 10 7
 litlen 32 5
 litlen 33 9
@@ -4275,16 +4276,9 @@
 match 258 1
 match 258 1
 match 258 1
 match 258 1
 match 200 1
-end
-!
-stored
-end
-!
-last
-fixed
 end
 !
 crc
 length

and this shows us what's happening here.

The original file has 3 different deflate blocks in the stream: the main block, followed by an empty block, followed by a final empty block.

Because our Zlib implementation keeps reading until the output buffer has something in it, it will read through those final 2 empty blocks, try to read again, and then EOFError because finally the stream is (actually) at its end.

I don't know what the fix here should be, other than changing gzfile_read_more to break unconditionally (even when ZSTREAM_BUF_FILLED(&gz->z) == 0), and removing the ZSTREAM_BUF_FILLED check from readpartial as well.

segiddins commented 8 months ago

Minimized test case:

#!/usr/bin/env ruby -w

require 'stringio'
require 'zlib'

contents = "a" * 2030
zd = Zlib::Deflate.new(Zlib::NO_COMPRESSION)
header = "\x1f\x8b\x08\x00\xc7\xe8\x02\x64\x02\x03".b
s = "".b
s << zd.deflate(contents)
s << zd.flush(Zlib::SYNC_FLUSH)
s << zd.finish
zd.close

s = header  << s[2..-5] << [Zlib.crc32(contents)].pack('l<*') << [contents.size].pack('l<*')

read = "".b
Zlib::GzipReader.wrap(StringIO.new(s)) do |gzio| 
  until gzio.eof?
    part = gzio.readpartial(1024) # RAISES
    read << part
  end
end
pp read == contents