python / cpython

The Python programming language
https://www.python.org
Other
62.75k stars 30.07k forks source link

Handle corrupted gzip files with unexpected EOF #41668

Open f80d741b-0877-4790-8e92-d489c0c6cfd4 opened 19 years ago

f80d741b-0877-4790-8e92-d489c0c6cfd4 commented 19 years ago
BPO 1159051
Nosy @loewis, @birkenfeld, @doko42, @larryhastings, @devdanzin, @benjaminp, @serhiy-storchaka
Files
  • t.gz: test gzip file with missing/corrupted CRC checksum
  • t.py: test script
  • gzip_eof.diff: patch
  • test_gzip_error.py
  • gzip_eof-3.4.patch: Patch for 3.4
  • gzip_eof-3.4_2.patch
  • lzma_deferred_error.patch: Defer the error if it is possible to return some data
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/serhiy-storchaka' closed_at = None created_at = labels = ['type-bug', 'library'] title = 'Handle corrupted gzip files with unexpected EOF' updated_at = user = 'https://bugs.python.org/calvin' ``` bugs.python.org fields: ```python activity = actor = 'python-dev' assignee = 'serhiy.storchaka' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'calvin' dependencies = [] files = ['6528', '6529', '6530', '8610', '28727', '28795', '28809'] hgrepos = [] issue_num = 1159051 keywords = ['patch'] message_count = 20.0 messages = ['47892', '47893', '56754', '82185', '143719', '179969', '180265', '180288', '180289', '180399', '180513', '181237', '184259', '184345', '184688', '185555', '188313', '188946', '189002', '189010'] nosy_count = 11.0 nosy_names = ['loewis', 'georg.brandl', 'doko', 'calvin', 'larry', 'ajaksu2', 'nadeem.vawda', 'benjamin.peterson', 'python-dev', 'yeeeev', 'serhiy.storchaka'] pr_nums = [] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue1159051' versions = ['Python 3.4'] ```

    f80d741b-0877-4790-8e92-d489c0c6cfd4 commented 19 years ago
    The GzipFile algorithm crashes when reading a corrupted
    .gz file (attached as t.gz) with a missing CRC checksum
    at the end.
    Tested with python2.3, python2.4 and CVS python on a
    Debian Linux system.
    $ python2.3 t.py
    Traceback (most recent call last):
      File "t.py", line 4, in ?
        print gzip.GzipFile('', 'rb', 9, fileobj).read()
      File "/usr/lib/python2.3/gzip.py", line 217, in read
        self._read(readsize)
      File "/usr/lib/python2.3/gzip.py", line 289, in _read
        self._read_eof()
      File "/usr/lib/python2.3/gzip.py", line 305, in _read_eof
        crc32 = read32(self.fileobj)
      File "/usr/lib/python2.3/gzip.py", line 40, in read32
        return struct.unpack("<l", input.read(4))[0]
    struct.error: unpack str size does not match format

    The attached patch (against current CVS) tries to cope with this situation by a) detecting the missing data by examining the rewind value and b) assuming that EOF is reached and returning the buffered uncompressed data (by raising EOFError)

    For history I encountered this kind of bug when downloading HTML pages with Content-Encoding: gzip. It seems some versions of the mod_gzip Apache module are producing corrupted gzip data.

    61337411-43fc-4a9c-b8d5-4060aede66d0 commented 17 years ago

    I don't think it is right to treat this as EOF. I haven't fully understood the problem (can you explain the format error encountered: what is expected, what is sent instead?), however, I notice that gzip itself also complains about an "unexpected EOF" error, so I don't think we should silently decompress the file. Producing an exception indicating the proper problem and including the data received so far in it would be more appropriate.

    Also, can you provide an example file for the test suite that isn't copyrighted by somebody else?

    f80d741b-0877-4790-8e92-d489c0c6cfd4 commented 16 years ago

    Here is a new test script that works with simple strings and no file objects. It reproduces the error by cutting off the last two bytes of the GZIP data.

    The resulting struct error is due to the read() methods missing a check that the requested amount of data is actually returned. In this case read(4) returned 2 bytes instead of 4, and the struct raises an error.

    I think the easiest way to handle this is to introduce a read_save(fileobj, size) method that checks that the read() data is of the requested size, else raise an error (perhaps an IOError?).

    btw: you can remove the t.{gz,py} files, the test_gzip_error.py replaces them.

    90baf024-6604-450d-8341-d796fe6858f3 commented 15 years ago

    Confirmed on trunk with test_gzip_error.py:

    struct.error: unpack requires a string argument of length 4

    c0702023-e086-42cb-aaca-31d5bbd19e1d commented 13 years ago

    What is the reason that the currently submitted patch is not good enough and current stage is "needs patch"? The current patch seem to solve this issue, which is a very common one when dealing with gzip files coming from the Internet. In any case, an indication on *why* the current patch is not good enough will help create a better patch that may be good enough.

    serhiy-storchaka commented 11 years ago

    At the moment gzip can raise two errors on unexpected EOF: struct.error from struct.unpack() or TypeError from ord(). Both bz2 and lzma raise EOFError in such cases.

    The proposed patch converts both truncated gzip errors to EOFError as for bz2 and lzma. Added similar tests for gzip, bz2 and lzma.

    I doubt about backward compatibility. It's obvious that struct.error and TypeError are unintentional, and EOFError is purposed for this case. However users can catch undocumented but de facto exceptions and doesn't expect EOFError.

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 11 years ago

    I've reviewed the patch and posted some comments on Rietveld.

    I doubt about backward compatibility. It's obvious that struct.error and TypeError are unintentional, and EOFError is purposed for this case. However users can catch undocumented but de facto exceptions and doesn't expect EOFError.

    I think it's fine for us to change it to raise EOFError in these cases.

    serhiy-storchaka commented 11 years ago

    Here is an updated patch addressing Nadeem Vawda's comments. Thank you.

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 11 years ago

    The updated patch looks good to me.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 11 years ago

    New changeset 174332b89a0d by Serhiy Storchaka in branch '3.2': Issue bpo-1159051: GzipFile now raises EOFError when reading a corrupted file http://hg.python.org/cpython/rev/174332b89a0d

    New changeset 87171e88847b by Serhiy Storchaka in branch '3.3': Issue bpo-1159051: GzipFile now raises EOFError when reading a corrupted file http://hg.python.org/cpython/rev/87171e88847b

    New changeset f2f947cdc5fe by Serhiy Storchaka in branch 'default': Issue bpo-1159051: GzipFile now raises EOFError when reading a corrupted file http://hg.python.org/cpython/rev/f2f947cdc5fe

    New changeset 214d8909513d by Serhiy Storchaka in branch '2.7': Issue bpo-1159051: GzipFile now raises EOFError when reading a corrupted file http://hg.python.org/cpython/rev/214d8909513d

    serhiy-storchaka commented 11 years ago

    Actually previous patch doesn't fix original problem, it only ensure that GzipFile consistent with BZ2File and LZMAFile.

    To fix original problem we need other patch, and this patch looks as new feature for 3.4. Here is a sample patch for LZMAFile. BZ2File patch will be similar, and GzipFile patch will be more different and complex.

    Now error doesn't raised immediately when read the file unexpectedly ended if some data can be read. Instead maximal possible part of read data returned and exception raising deferred to next read (see tests).

    Perhaps we need a new flag for constructor or for read() which enables a new behavior (what will be a good name for this?). Or we can use a special value for size argument which means "read to the end as much as possible" (we can differentiate the behavior for size\<0 and size=None). Unconditional enabling a new behavior for size >=0 is safe.

    252699e1-f617-4a8b-9fb0-ae90945f6292 commented 11 years ago

    I think the new behavior should be controlled by a constructor flag, maybe named "defer_errors". I don't like the idea of adding the flag to read(), since that makes us diverge from the standard file interface. Making a distinction between size\<0 and size=None seems confusing and error-prone, not to mention that we (again) would have read() work differently from most other file classes.

    I'd prefer it if the new behavior is not enabled by default for size>=0, even if this wouldn't break well-behaved code. Having a flag that only controls the size\<0 case is inelegant, and I don't think we should change the default behavior unless there is a clear benefit to doing so.

    doko42 commented 11 years ago

    this change breaks a test case in the bzr testsuite; will try to get to it next week. See https://launchpad.net/bugs/1116079

    serhiy-storchaka commented 11 years ago

    tuned_gzip does dangerous things, it overloads private methods of GzipFile.

    From Bazaar 2.3 Release Notes:

    Current version is 2.6b2. bzrlib.tuned_gzip.GzipFile should be removed two releases ago.

    serhiy-storchaka commented 11 years ago

    I will be offline some time. Feel free to revert these changes in 2.7-3.3 if it is necessary.

    doko42 commented 11 years ago

    another test case failure with this patch: https://launchpad.net/ubuntu/+archive/test-rebuild-20130329/+build/4416983

    reproducible with feedparser 5.1.3 from pypi, on x86 (but not x86_64).

    ERROR: test_gzip_struct_error (main.TestCompression) ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "./feedparsertest.py", line 433, in test_gzip_struct_error
        f = feedparser.parse('http://localhost:8097/tests/compression/gzip-struct-error.gz')
      File "/build/buildd/feedparser-5.1.2/feedparser/feedparser.py", line 3836, in parse
        data = gzip.GzipFile(fileobj=_StringIO(data)).read()
      File "/usr/lib/python2.7/gzip.py", line 253, in read
        while self._read(readsize):
      File "/usr/lib/python2.7/gzip.py", line 323, in _read
        self._read_eof()
      File "/usr/lib/python2.7/gzip.py", line 340, in _read_eof
        crc32, isize = struct.unpack("<II", self._read_exact(8))
      File "/usr/lib/python2.7/gzip.py", line 189, in _read_exact
        raise EOFError("Compressed file ended before the "
    EOFError: Compressed file ended before the end-of-stream marker was reached

    Ran 4237 tests in 5.190s

    FAILED (errors=1) ---------------------------------------- Exception happened during processing of request from ('127.0.0.1', 43939)

    serhiy-storchaka commented 11 years ago

    In both cases broken applications use the undocumented implementation details. But such changes are extremely strong for bugfix release and they should not be done without a special need. I propose to revert these changes in 2.7, 3.2 and 3.3 (possibly leaving in the default branch). Unfortunately, I was not online right before the release of the latest bugfix release and failed to do this. Fortunately, it is now possible to fix this in regression fix releases.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 11 years ago

    New changeset abc780332b60 by Benjamin Peterson in branch '2.7': backout 214d8909513d for regressions (bpo-1159051) http://hg.python.org/cpython/rev/abc780332b60

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 11 years ago

    New changeset 854ba6f414a8 by Georg Brandl in branch '3.2': Issue bpo-1159051: Back out a fix for handling corrupted gzip files that http://hg.python.org/cpython/rev/854ba6f414a8

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 11 years ago

    New changeset 9c2831fe84e9 by Georg Brandl in branch '3.3': Back out patch for bpo-1159051, which caused backwards compatibility problems. http://hg.python.org/cpython/rev/9c2831fe84e9

    New changeset 5400e8fbc1de by Georg Brandl in branch 'default': null-merge reversion of bpo-1159051 patch from 3.3 http://hg.python.org/cpython/rev/5400e8fbc1de