mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
364 stars 7 forks source link

"Next block offset index is out of sync!" with 0.13.0 on a raw deflate stream #37

Closed pdjstone closed 6 months ago

pdjstone commented 6 months ago

I get the following exception when using rapidgzip 0.13.0 to decompress a specific raw deflate stream: RuntimeError: Next block offset index is out of sync! Requested offset to index 11070 and got 1481048062 but expected 1481048069. If I decompress the same data but with the gzip header and footer I don't get the exception. rapidgzip 0.12.1 works fine.

Currently I can only reproduce this with a specific ~300MB file and I've not been able to create a reduced test case. I can provide the file if you need it (it's part of a publicly available firmware image), but the following code triggers the issue for me:

import io
import shutil
import sys
import zlib
from tempfile import TemporaryFile
from traceback import print_exc

import rapidgzip

def make_gzip(in_fd, out_fd):
    c = zlib.compressobj(wbits=31)
    while data := in_fd.read(1024*1024):
        out_fd.write(c.compress(data))
    out_fd.write(c.flush())
    in_fd.seek(0)
    out_fd.seek(0)

def strip_gz_header_footer(in_fd, out_fd):
    in_fd.seek(10)
    shutil.copyfileobj(in_fd, out_fd)
    out_fd.seek(-8, io.SEEK_END)
    out_fd.truncate()

def do_inflate(fd, wbits):
    fd.seek(0)
    d = zlib.decompressobj(wbits)
    last_data = None
    while data := fd.read(1024*1024):
        decomp = d.decompress(data)
        if decomp:
            last_data = decomp
    decomp = d.flush()
    if decomp:
        last_data = decomp
    fd.seek(0)
    print(last_data[-64:])

def do_rapidgzip_inflate(fd):
    try:
        with rapidgzip.open(fd, verbose=True, parallelization=4) as gz_fd:
            gz_fd.seek(-64, io.SEEK_END)
            print(gz_fd.read(64))
    except:
        print_exc()

if __name__ == '__main__':
    with open(sys.argv[1], 'rb') as fd:
        with TemporaryFile('wb+') as temp_gz:
            print('deflating file with gzip headers...')
            make_gzip(fd, temp_gz)

            print('inflating gz with Python zlib..')
            do_inflate(temp_gz, 31)

            print('inflating gz with rapidgzip...')
            do_rapidgzip_inflate(temp_gz) # this works on 0.12.1 and 0.13.0

            with TemporaryFile('wb+') as temp_deflate:
                print('stripping gz header and footer...')
                strip_gz_header_footer(temp_gz, temp_deflate)

                print('inflating raw deflate stream with Python zlib..')
                do_inflate(temp_deflate, -15)

                print('inflating raw deflate stream with rapidgzip...') 
                do_rapidgzip_inflate(temp_deflate) # exception with 0.13.0, but not 0.12.1
mxmlnkn commented 6 months ago

I can provide the file if you need it (it's part of a publicly available firmware image)

Yes, could you please provide the file? It's nice to know that it works in 0.12.1, then I can do a bisection to find the problematic commit.

pdjstone commented 6 months ago

The file is here - https://drive.google.com/file/d/1XDy3UZkBa0M8SeCriY8_xV_sFuuQIB75/view?usp=sharing

mxmlnkn commented 6 months ago

Thank you. Commit 0df5d753 introduced the bug, or at least it made it more likely to happen because it accidentally tries a chunk size of 1 B for verbose=True (and 0 B for verbose=False), which gets clamped to the allowed minimum of 8 KiB instead of the default 4 MiB. I have pushed a fix and will do a bugfix release today.

I want to give the underlying issue a look but I think I already know what happens: a full false positive. It happens because there is "only" 8 KiB of data that can be checked. And if for example a match for a non-compressed block was seemingly found, which isn't that unlikely (roughly once every 256 KiB) then the next up to 64 KiB of data cannot deliver anything to check against. Only the next deflate block header could then be checked and it would already be outside of the chunk size. I guess a minimum chunk size of ~128 KiB would give the probability of that happening a very huge bump down. And to not be concerned with probabilities at all, I could simply implement a fallback that decompresses on demand instead of throwing if such an error is detected.

If you want a quick and dirty workaround without installing from the updated source, try rapidgzip.open(fd, verbose=4 * 1024 * 1024, parallelization=4). It fixes the issue in your reproducer for me.

pdjstone commented 6 months ago

Thanks, that does fix it for me - I hadn't realised I was passing in the parameters incorrectly. However, I also see the issue with 0.13.0 when just doing rapidgzip.open(fd), and was also having generally poor performance on other files even when not encountering that specific exception. Is it choosing an 8 KiB chunk size by default?

mxmlnkn commented 6 months ago

Thanks, that does fix it for me - I hadn't realised I was passing in the parameters incorrectly.

You did everything correct. The bug was with the rapidgzip Python layer. I added a new argument for the chunk size and the I/O read method and this broke some calls that only used positional arguments.

Is it choosing an 8 KiB chunk size by default?

The C++ backend is doing std::max( 8_Ki, chunkSize ), and because of the argument bug chunkSize on the Python-side defaulted to 0. This should be fixed now and the performance should also be restored. The default should be 4 MiB.