mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
364 stars 7 forks source link

Support for zlib file format (RFC1950) and raw deflate streams #9

Closed mxmlnkn closed 9 months ago

mxmlnkn commented 1 year ago

Deflate streams are already supported because they are inside gzip files, to support files containing raw deflate streams, I only would have to skip header and footer reading. The zlib format is just another container around deflate and would require a different readHeader and readFooter method. Maybe I could abstract away readHeader and readFooter to cover all three cases.

ap-- commented 1 year ago

Hi @mxmlnkn,

I think I have a use case for this, and would greatly appreciate to get your opinion on the issue I am currently facing.

I am reading data from bigwig files (https://genome.ucsc.edu/goldenPath/help/bigWig.html), which contain some genome information that is stored in many zlib compressed blocks. Each block in the bigwig file stores up to 1024 (rows) * 12 (uint32, uint32, float32) + 24 (header) = 12312 bytes (uncompressed) and is compressed via zlib (default compression, header: 0x78 0x9c). The compressed size per block varies between 4 and 6 kilobytes. There are some blocks that contain fewer rows, which is apparent from the decompressed header and of course the smaller uncompressed result data.

I now need to quickly iterate over uniformly sampled batches (batch_size ~1000) of these blocks. One batch ends up being 1000 zlib compressed chunks concatenated to ~5MB compressed total and ~12MB uncompressed. I currently takes ~30 milliseconds to decompress one batch. And ideally I would need to shave off a factor of 10.

I have implemented some comparison code here https://github.com/ap--/concatenated-zlib to compare pythons stdlib.zlib vs a cython wrapper around zlib and zlibng, but they run equally fast, which I didn't expect somehow... very crude benchmark is here

Do you have any suggestions on how this could be done in the fastest way?

Thank you for your time! Cheers, Andreas 😃

mxmlnkn commented 1 year ago

I have implemented some comparison code here https://github.com/ap--/concatenated-zlib

The mentioned cd bench; python time_load_chunks.py does not work out of the box ;). I had to do something more akin to:

git clone --recursive https://github.com/zlib-ng/zlib-ng.git
cd zlib-ng/ && mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=/opt/zlib-ng ..
make -j $( nproc ) install
export CPATH=/opt/zlib-ng/include
export LD_LIBRARY_PATH="/opt/zlib-ng/lib:$LD_LIBRARY_PATH"
export LIBRARY_PATH="/opt/zlib-ng/lib:$LIBRARY_PATH"

git clone --recursive https://github.com/ap--/concatenated-zlib.git
cd concatenated-zlib/
python3 -m pip install --user .
cd bench; python time_load_chunks.py 

Results:

True
True
overhead took 0.0001824248197954148 seconds
stdlib.zlib took 0.03630780723062344 seconds
zlib_concat_decode took 0.03545635042944923 seconds
zlibng_concat_decode took 0.02819499616045505 seconds

I don't find it surprising that stdlib.zlib and zlib_concat_decode are similarly fast because CPython probably also only uses zlib, or am I missing something? It seems, however, that contrary to your results, zlib-ng is actually visibly faster than zlib on my machine.

You could also try libdeflate, which is supposedly quite fast but I think it might be harder to use because it might have an API that is different from zlib. The fastest gzip decompressor known to me, igzip "only" achieves ~800 MB/s for files with a compression ratio of 3, comparable to your use case.

If all those 1000 zlib chunks per block are compressed independently, then parallel decompression should also be possible with less complexity than rapidgzip. You would only need to know at which positions these chunks begin. If you don't know them, then the problem becomes more difficult as you would have to search for them, which is what rapidgzip does...

Aside from parallelization, I don't think that a speedup of 10 is possible, especially as the 30 ms per 12 MB already translate to 400 MB/s, which, in my experience, is already fast. However, chunks of 5 MB are too small to parallelize. At this point, you could only parallelize the decompression of multiple batches and this would be embarrassingly parallel, so it probably is not the solution you want to hear...

One avenue for optimization with lots of short compressed data parts would be that you try to reduce setup overhead, but it seems like you are already calling zng_inflateInit only once per stream instead of once per zlib chunk.

ap-- commented 1 year ago

Thank you so much for the pointers! :heart:

libdeflate is indeed faster by quite a margin. Even when I allocate the libdeflate compressor for every chunk, I see a significant speedup:

stdlib.zlib took 0.029090170830022542 seconds
zlib_concat_decode took 0.029393986669892912 seconds
zlibng_concat_decode took 0.028708101249940228 seconds
libdeflate_zlib_decompress took 0.01889676667022286 seconds

I'll play around with this a bit more and will check if I can parallelize better.

Have a great week 😃 Andreas

mxmlnkn commented 1 year ago

Glad I could help somewhat. On my system (AMD Ryzen 9 3900X) the results for libdeflate are roughly the same as zlib-ng:

stdlib.zlib took 0.03618937029037625 seconds
zlib_concat_decode took 0.035477809429867196 seconds
zlibng_concat_decode took 0.028111176129896193 seconds
zlib_multi_decompress took 0.06991837109671906 seconds
libdeflate_zlib_decompress took 0.0272582576808054 seconds
mxmlnkn commented 1 year ago

Some further thoughts before I give this issue another try:

ap-- commented 1 year ago

Hi @mxmlnkn

fyi: I ended up doing this as you describe in your third bullet point, by stripping the zlib headers for each chunk. After a bit of experimenting I opted for using Nvidia's nvcomp library through kvikio to decode large batches of chunks directly on the GPU. We use the data downstream for model training, so the data doesn't need to be copied back to the host.

I have a fork of kvikio with the pieces added to support deflate decompression here: https://github.com/ap--/kvikio/tree/with-deflate (need to find some time to cleanup the small changes and upstream them with tests).

pdjstone commented 9 months ago

Having support for raw deflate streams would be very useful for me. With my use case, I already know the compressed and decompressed size of the stream beforehand.

pdjstone commented 9 months ago

I managed to do this in Python without requiring any modification to rapidgzip. I made a class that wraps a raw deflate stream to provide a file-like object that includes the gzip header and footer that rapidgzip needs. Here's a gist that uses this to extract an item from a ZIP file. The gzip footer includes the CRC32 of the uncompressed file, which the ZIP format handily includes.

https://gist.github.com/pdjstone/29b7ea3455d05e637573c0f3c1bdffdf

It would still be useful to have a way to pass a raw deflate stream into rapidgzip, along with the uncompressed size and CRC (if known)

mxmlnkn commented 9 months ago

Passing the checksum seems reasonable but why pass the uncompressed size? Because of the parallelization structure, it couldn't be used for optimization anyway, unfortunately. Aside from that, I'm working on this issue. Theoretically, you only need to check for more magic bytes for it to work but depending on the detected file type, there are also some choices, e.g., how to handle multiple concatenated gzip streams, and reading the footer and computing the correct checksum.

pdjstone commented 9 months ago

If the uncompressed size isn't needed, great. I assume the checksum could be ignored if we don't have it?

mxmlnkn commented 9 months ago

I assume the checksum could be ignored if we don't have it?

Yes. A getter probably makes sense so that it can be verified outside if necessary.

mxmlnkn commented 9 months ago

I have implemented support for zlib and raw deflate files / inputs in the https://github.com/mxmlnkn/indexed_bzip2/tree/zlib-support branch. You can try it out with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@zlib-support#egginfo=rapidgzip&subdirectory=python/rapidgzip'

Please report back any problems.

I'm still working on support for arbitrary concatenations of deflate and zlib streams, which per se are not allowed but I would find it useful to implement zip support via one single RapidgzipFile object and therefore with only one shared cache to reduce memory overhead.

mxmlnkn commented 9 months ago

I managed to do this in Python without requiring any modification to rapidgzip. I made a class that wraps a raw deflate stream to provide a file-like object that includes the gzip header and footer that rapidgzip needs. Here's a gist that uses this to extract an item from a ZIP file. The gzip footer includes the CRC32 of the uncompressed file, which the ZIP format handily includes.

https://gist.github.com/pdjstone/29b7ea3455d05e637573c0f3c1bdffdf

It would still be useful to have a way to pass a raw deflate stream into rapidgzip, along with the uncompressed size and CRC (if known)

Btw, ratarmount has some classes similar to what you are doing: https://github.com/mxmlnkn/ratarmount/blob/master/core/ratarmountcore/StenciledFile.py

pdjstone commented 9 months ago

I have implemented support for zlib and raw deflate files / inputs in the https://github.com/mxmlnkn/indexed_bzip2/tree/zlib-support branch.

This works well for me - it seems to automatically detect that the stream is raw deflate rather than gzip and just works. There aren't (yet) any extra methods exposed to the Python side to tell it what the stream type is, or provide the CRC, is this correct?

mxmlnkn commented 9 months ago

I have implemented support for zlib and raw deflate files / inputs in the https://github.com/mxmlnkn/indexed_bzip2/tree/zlib-support branch.

This works well for me - it seems to automatically detect that the stream is raw deflate rather than gzip and just works. There aren't (yet) any extra methods exposed to the Python side to tell it what the stream type is, or provide the CRC, is this correct?

Yes, that is correct. I'm struggling a bit as to how to do this correctly, especially in the case that there are multiple gzip/zlib/deflate streams in the same file. (while multiple zlib/deflate streams are not allowed in the standards, multiple gzip streams is allowed and used often). I would have to propagate a checksum for each stream somehow either accepting them via the API or returning them for the caller to verify. Also, zlib would require Adler32 checksums, which is also not supported yet.

The stream type should be easier to expose because currently mixed stream types in the same file are not allowed.

There is also the whole index issue. If an index is loaded, then checksum verification becomes more difficult because it might be that not all preceding data is read when requesting some later data, which makes checksum computation impossible. I wanted to add a checksum for each chunk / index seek point to still enable some kind of verification in that case but this does not help with the API, which should return checksums for each stream.

What I'm currently imagining is an API like this: setDeflateStreamCRC32( size_t endOfStreamOffset, uint32_t checksum ). The CRC32 verification code could then check the database of configured checksums when the stream end has been encountered. I would associate the checksum with the stream end instead of the stream start because it is easier that way for the code, which might not know the stream start offset at verification time anymore. I could even integrate this into a dummy footer reading routine, which only would know the end of stream offset, like this.

pdjstone commented 9 months ago

For my use case, it works well as is - I don't really care about the CRC as I know my input data is correct. I wasn't sure with the algorithm that rapidgzip uses whether the output is always guaranteed to be correct, or if there are edge cases where the CRC catches invalid decompression.

mxmlnkn commented 9 months ago

For my use case, it works well as is - I don't really care about the CRC as I know my input data is correct. I wasn't sure with the algorithm that rapidgzip uses whether the output is always guaranteed to be correct, or if there are edge cases where the CRC catches invalid decompression.

In general, for any gzip decompression, CRC verification can catch errors that would otherwise go unnoticed. In practice, I think it is rare, though. The CRC is computed on the output, not the input. Assuming a correct decoder based on many unit and regression tests, the CRC should only mismatch when the input has become corrupted. And a corrupted input will in my experience often have avalanche effects that will lead to error detection solely based on the assumed input or output length, the same can be said about decoder bugs. For example, if there is a single bit flip, then the Huffman decoding might stop sooner than it should, which would be detected. CRC32 might help to detect bit flips in non-compressible parts, though.

mxmlnkn commented 9 months ago

@pdjstone I have fixed the last remaining issues and implemented a getter for RapidgzipFile: file_type and two getters: set_deflate_stream_crc32s(self, crc32s), add_deflate_stream_crc32(self, end_of_stream_offset_in_bytes, crc32). Your gist now works for me like this with the zlib-support branch:

import os
import shutil
import sys
from zipfile import ZipFile, ZIP_DEFLATED, ZipExtFile

from ratarmountcore.StenciledFile import StenciledFile
import rapidgzip

class RapidZip(ZipFile):
    def __init__(self, file):
        super().__init__(file, 'r')

    def open(self, item):
        if isinstance(item, str):
            item = super().getinfo(item)
        zef : ZipExtFile = super().open(item)

        if item.compress_type == ZIP_DEFLATED:
            deflate_fd = StenciledFile([(self.fp, zef._orig_compress_start, zef._orig_compress_size)])
            wrapped = rapidgzip.open(deflate_fd, verbose=True)
            wrapped.add_deflate_stream_crc32(zef._orig_compress_size, item.CRC)
            return wrapped
        print("Warning - not a deflate stream")
        return zef

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print(f'Usage: {sys.argv[0]} zipfile itemname [outfile]')
        sys.exit(-1)

    zip = RapidZip(sys.argv[1])

    if len(sys.argv) < 3:
        for n in zip.namelist():
            print(n)
        print('Please specify a file to extract')
        sys.exit(-1)

    out_name = item_name = sys.argv[2]
    if len(sys.argv) >= 4:
        out_name = sys.argv[3]

    os.makedirs(os.path.dirname(out_name), exist_ok=True)
    with open(out_name, 'wb') as out_fd:
        with zip.open(item_name) as item_fd:
            shutil.copyfileobj(item_fd, out_fd)

The CRC32 is verified as can be tested by specifying a wrong CRC32.

mxmlnkn commented 9 months ago

@ap-- I have revisited your problem with many small concatenated zlib streams again. It didn't work performantly directly after adding zlib support because those zlib streams are so small that they contain only a single (final) deflate block and those are not looked for by the block finder. After adding support to look for those final blocks, which almost halves the block finder performance, it now works surprisingly well. The results for my notebook are below.

I have increased the problem size from 4.9 MB to 196 MB to get some parallelization going with rapidgzip and its (adjustable) 4 MiB chunk size:

Concatenated size: 196.92 MB
overhead took 0.0 seconds
stdlib.zlib took 1.97 seconds
stdlib.zlib without joining took 1.422 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=1 took 1.406 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=2 took 0.775 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=4 took 0.439 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=8 took 0.365 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=16 took 0.361 seconds
rapidgzip_decompress from BytesIO parallelization=1 took 2.098 seconds
rapidgzip_decompress from BytesIO parallelization=2 took 1.603 seconds
rapidgzip_decompress from BytesIO parallelization=4 took 1.048 seconds
rapidgzip_decompress from BytesIO parallelization=8 took 1.166 seconds
rapidgzip_decompress from BytesIO parallelization=16 took 1.312 seconds
rapidgzip_decompress from file parallelization=1 took 2.088 seconds
rapidgzip_decompress from file parallelization=2 took 1.669 seconds
rapidgzip_decompress from file parallelization=4 took 0.988 seconds
rapidgzip_decompress from file parallelization=8 took 1.133 seconds
rapidgzip_decompress from file parallelization=16 took 1.27 seconds
zlib_concat_decode took 1.954 seconds
zlibng_concat_decode took 1.635 seconds
zlib_multi_decompress took 2.018 seconds
libdeflate_zlib_decompress took 1.374 seconds

Here, rapidgzip is 3-5x faster than the alternatives on 8 physical cores, which is faster in absolute terms but less efficient. These would be the results for the original problem size.

Concatenated size: 4.923 MB
overhead took 0.0 seconds
stdlib.zlib took 0.052 seconds
stdlib.zlib without joining took 0.037 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=1 took 0.043 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=2 took 0.06 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=4 took 0.065 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=8 took 0.068 seconds
rapidgzip_decompress from BytesIO read and discard 1.0 MiB chunks parallelization=16 took 0.066 seconds
rapidgzip_decompress from BytesIO parallelization=1 took 0.054 seconds
rapidgzip_decompress from BytesIO parallelization=2 took 0.084 seconds
rapidgzip_decompress from BytesIO parallelization=4 took 0.084 seconds
rapidgzip_decompress from BytesIO parallelization=8 took 0.051 seconds
rapidgzip_decompress from BytesIO parallelization=16 took 0.084 seconds
rapidgzip_decompress from file parallelization=1 took 0.053 seconds
rapidgzip_decompress from file parallelization=2 took 0.083 seconds
rapidgzip_decompress from file parallelization=4 took 0.051 seconds
rapidgzip_decompress from file parallelization=8 took 0.106 seconds
rapidgzip_decompress from file parallelization=16 took 0.113 seconds
zlib_concat_decode took 0.058 seconds
zlibng_concat_decode took 0.035 seconds
zlib_multi_decompress took 0.119 seconds
libdeflate_zlib_decompress took 0.03 seconds

Here, rapidgzip is slower because the test data is only 4.9 MB, which will be split into a 4 MB and a 0.9 MB chunk and set two at most 2 threads. I have tested different versions of rapidgzip because it was slower than command line benchmarks. As you can see, a loop of read( 1 MiB ) calls is much faster than a single read call returning everything. I'm not sure how much I can do to resolve this in the Python interface or whether it won't work any better because it leads to large slow memory usage that does not fit into any cache.

I'm not yet sure whether to include the performance fix for this setup (looking for final deflate blocks) into rapidgzip because it might slow down all other use cases. Maybe I can do it more smartly, e.g. only look for final blocks when I didn't find any non-final ones or maybe only enable this via an interface flag... Or I could increase the chunk size to keep the relative block finder overhead constant, or maybe the final performance hit isn't all that bad as feared, which seems to be the case on a test with 10x.silesia.tar.gz on my notebook.

mxmlnkn commented 9 months ago

Implemented with 0.12.0.