mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
364 stars 7 forks source link

Sigabort() on BitBuffer::peekUnsafe on master with packaged ZLIB #41

Open esahione opened 4 months ago

esahione commented 4 months ago

Hi,

I'm iterating over a few thousand gzipped files and I get the following sigabort randomly on line 716 of BitReader:

BitBuffer BitReader<MOST_SIGNIFICANT_BITS_FIRST, BitBuffer>::peekUnsafe(bit_count_t) const [with bool MOST_SIGNIFICANT_BITS_FIRST = false; BitBuffer = long unsigned int; bit_count_t = unsigned int]: Assertion bitsWanted > 0 failed.

I'm using the latest master version and calling it from C++. It happens at random, without a particular pattern. But it happens consistently enough that after about 130-150 files it aborts.

Here's how I'm calling it:

bool extract_date_file(std::string date) {
    std::string input_filename = get_input_filename(date);
    std::string output_filename = get_output_filename(date);
    UniqueFileReader file_reader = std::make_unique<StandardFileReader>(input_filename);

    const auto processor_count = 8; // 8 threads per file
    const size_t n_threads = processor_count; // use a lot of concurrency
    const size_t reader_chunk_size = (1ull<<20); // 4mb might be too large...

    rapidgzip::ParallelGzipReader<> reader(std::move(file_reader), n_threads, reader_chunk_size);
    std::unique_ptr<OutputFile> output_file = std::make_unique<OutputFile>(output_filename);

    const auto output_file_descriptor = output_file ? output_file->fd() : -1;
    if (output_file_descriptor == -1) {
        std::cout << "[FILE_EXTRACTION] Error in getting output file descriptor..." << std::endl;
        return false;
    }

    std::function<void( const std::shared_ptr<rapidgzip::ChunkData>, size_t, size_t )> write_functor =
        [output_file_descriptor]
        ( const std::shared_ptr<rapidgzip::ChunkData>& chunkData,
            size_t const                               offsetInBlock,
            size_t const                               dataToWriteSize )
        {
            const auto errorCode = rapidgzip::writeAll(chunkData, output_file_descriptor, offsetInBlock, dataToWriteSize);
            if ( errorCode == EPIPE ) {
                throw;
            }
        };

    short int error_code = 0;
    size_t total_bytes_read = reader.read(write_functor);
    if (total_bytes_read == 0) {
        std::cout << "[FILE_EXTRACTION] No bytes read in decompression - failure in loading data..." << std::endl;
        return false;
    }

    if (output_file) {
        output_file->truncate(total_bytes_read);
        output_file.reset();
    }
    std::cout << "[FILE_EXTRACTION] Successful extraction..." << std::endl;
    return true;
}

I'm on Linux and using the C++ library (as per the code above) - the ZLIB version is the packaged version (I changed it from my system to see if it would stop the error to no avail). Each gzipped file is anywhere from 300mb to 5gb.

Any help would be appreciated - I have no idea where to begin since the backtrace is quite large and I'm not familiar with the internals of the library.

Thanks!

mxmlnkn commented 3 months ago

I'm sorry for your trouble.

First off, I'm surprised that you are running in debug mode with asserts in the first place as it is multiple times slower in my experience than release mode.

As for how you can help me debug this:

esahione commented 3 months ago

Thanks! Will do the above. By the way, if I set the chunk size to (1ull<<24) I don't get the issue as frequently. At (1ull<<16) I get it as frequently or maybe slightly more. Maybe there are empty chunks being sent to the pool?

I'll let you know what I can do re the above and put it here.

mxmlnkn commented 3 months ago

Thanks! Will do the above. By the way, if I set the chunk size to (1ull<<24) I don't get the issue as frequently. At (1ull<<16) I get it as frequently or maybe slightly more. Maybe there are empty chunks being sent to the pool?

1 << 16 is 64 KiB. I think the chunk size will be forced to a minimum of 128 KiB, i.e., setting it to anything lower will always result in a chunk size of 128 KiB. If the chunk size becomes too small, then the number of false positives will increase because there is an insufficient amount of data to check for redundancy in a single chunk. That's why I limited the chunk size like this. Furthermore, the overhead for searching for deflate block starts also becomes relatively larger for small chunk sizes. This is also the reason for the relatively large default of 4 MiB, simply because it yields the best performance in my benchmarks. I don't think empty chunks have much to do with this, although I'm not quire sure what you mean with empty.

The triggered assert has questionable importance. It checks that the caller requests 1 or more bits. There should be no code trying to read 0 bits. And that's good and required behavior because I removed a branch for performance reasons. That branch would be necessary thanks to undefined behavior when doing a shift such as uint64_t(1) << ( 64 - 0 ), i.e., a shift as large as the bit width or larger. Normally, I would expect the result to be 0, but thanks to undefined behavior, the result will probably be random / the call may be optimized out completely. Then again, if 0 bits are requested, it may not even matter much to the caller what kind of value is received... Therefore, it would be really good to know which code actually triggers this. It might even expose a bug at some other place. Otherwise, I could simply add an if ( bitsWanted == 0 ) { return 0 } check and remove the assert and be done with it.

mxmlnkn commented 3 months ago

Did you find time to generate a backtrace? That would help me the most. The other two debugging suggestions are only if the backtrace doesn't clear up anything. I may even be able to reconstruct a synthetic reproducer from the backtrace.

esahione commented 3 months ago

Apologies, busy with a newborn. Will get to it soon.