mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
345 stars 7 forks source link

Hang/deadlock when seeking in gzip streams in version 0.11 #27

Closed pdjstone closed 6 months ago

pdjstone commented 6 months ago

After the fix for issue #26, I get a reproducible hang when rapidly seeking forwards and backwards in the gzip file. This doesn't happen with the previous release.

The following code reliably reproduces the issue for me:

import os
from gzip import GzipFile
from random import randbytes
from tempfile import NamedTemporaryFile

import rapidgzip

def create_temp_gzip_file(size_mb=2):
    with NamedTemporaryFile('wb', suffix='.gz', delete=False) as tf:
        with GzipFile(mode='wb', fileobj=tf, compresslevel=2) as gz:
            for _ in range(size_mb):
                gz.write(randbytes(1024*1024))
    return tf.name

class myfile:
    def __init__(self, fd):
        self.fd = fd

    def seek(self, offset, whence=0) -> int:
        return self.fd.seek(offset, whence)

    def read(self, amt=-1) -> bytes:
        return self.fd.read(amt)

    def tell(self) -> int:
        return self.fd.tell()

    def seekable(self):
        return self.fd.seekable()

    def readable(self):
        return self.fd.readable()

    def writable(self):
        return self.fd.writable()

    def write(self, data):
        raise NotImplemented()

    def close(self):
        self.fd.close()

if __name__ == '__main__':
    print('creating random gz file...')
    # hang doesn't seem to happen for smaller sizes (e.g. 2MB)
    temp_gz = create_temp_gzip_file(size_mb=100)

    print('opening gz...')
    with open(temp_gz, 'rb') as fp:
        gzip_fd = myfile(fp) # hang only seems to happen with a Python wrapper
        with rapidgzip.open(gzip_fd, verbose=True) as fd:
            for i in range(100): # hang normally happens for me before the 10th iteration
                print(i)
                fd.seek(0)
                fd.seek(1000)

    print('done')
    os.unlink(temp_gz)
mxmlnkn commented 6 months ago

Thanks for reporting and the reproducer.

This is a common situation and C++ would have support for locking multiple mutexes, but the GIL doesn't fit this scheme and, furthermore, because of the abstraction layers, no code point knows that two locks have to be locked at once.

I don't see why this should have changed with 0.11.0 from 0.10.4, but it doesn't matter. I'm so close to simply reverting all those pains that the "fix" for #24 introduced...

mxmlnkn commented 6 months ago

I have tried to fix that deadlock by requiring to have the GIL before trying to lock SharedFileReader but now it deadlocks in some other way:

The problem seems that Thread 3 at some point inside PyObject_Call releases the GIL and Thread 2 snags it right up. I accidentally found a quote that might explain why it happens:

In order to emulate concurrency of execution, the interpreter regularly tries to switch threads (see sys.setswitchinterval()). The lock is also released around potentially blocking I/O operations like reading or writing a file, so that other Python threads can run in the meantime.

This is a nightmare.

The reversed order also does not work in case the GIL has already been acquired outside, which happens when the main Python thread calls a relevant function. To avoid that, we need to unlock the GIL first in order to force the correct lock ordering.

With all that and lots of removed comments, the atrocity to avoid the deadlock looks like this:

struct FileLock
{
    explicit
    FileLock( std::mutex& mutex ) :
        m_fileLock( mutex )
    {}

private:
#ifdef WITH_PYTHON_SUPPORT
    const ScopedGILUnlock m_globalInterpreterUnlock;
#endif
    const std::unique_lock<std::mutex> m_fileLock;
#ifdef WITH_PYTHON_SUPPORT
    const ScopedGILLock m_globalInterpreterLock;
#endif
};

I have pushed the fix to https://github.com/mxmlnkn/indexed_bzip2/tree/zlib-support

Can I copy-paste your reproducer into the CI Python tests (src/tests/testPythonWrappers.py) or do you want to open a PR for proper author attributions?

pdjstone commented 6 months ago

Happy for you to just copy-paste the code. I don't envy you tracking down and fixing these hairy Python threading/GIL edge cases. Thanks for the project though, it's been fantastically useful for me.

pdjstone commented 6 months ago

Thanks, the fix works for me with both my minimised test case and my more complex original code

mxmlnkn commented 6 months ago

Thank you for testing. It will be released as 0.11.1 shortly.