mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
345 stars 7 forks source link

Reduce memory footprint for highly compressed data #2

Closed mxmlnkn closed 1 year ago

mxmlnkn commented 1 year ago

The most important missing "feature" is any kind of memory limit control for very high compression ratios. This is what's stopping me from integrating it into ratarmount.

The most straightforward idea would be to modify the result type to also contain information necessary for resuming decoding in case a limit has been reached. With this, threads could try to decode their assigned work and in case it exceeds e.g. a decompression factor of 10, they simply stop decoding and return that resuming data instead. Then, the orchestrator thread can resume decompressing in a sequential streaming manner so that it can even work with compression ratios up to the theoretical limit of 1032x.

It sounds pretty straightforward but I need a weekend and motivation and might need some refactoring in order to do the decompression outside of the normal thread pool. Maybe even hide it inside a Result-Object interface that normally simply copies the internal buffers but can also resume decompression.

As a real world test archive, I currently have the CTU-13 dataset recompressed from bz2 to gz. This file contains quite a lot of blocks with compression ratios around 80x, 2 even with 166x. These do not only slow down parallel decoding because of work balancing issues but they also lead to usage of >30 GB, when using 24 threads, which on my system results in an out-of-memory exception.

Some statistics as derived with pragzip --analyze:

== Encoded Block Size Distribution ==

     0 bits |==================== (60243)
            |===                  (9993)
            |=                    (3443)
            |                     (1334)
            |=============        (39667)
            |=                    (4368)
            |                     (75)
480182 bits |==                   (8527)

== Decoded Block Size Distribution ==

          0 Bytes |==================== (105239)
                  |                     (281)
                  |                     (9)
                  |                     (5)
                  |                     (7)
                  |                     (2570)
                  |                     (667)
3.48131e+06 Bytes |===                  (18872)

== Compression Ratio Distribution ==

9.985988e-01 Bytes |==================== (96008)
                   |=                    (9338)
                   |                     (788)
                   |                     (2654)
                   |===                  (18860)
                   |
                   |
1.659447e+02 Bytes |                     (2)

== Deflate Block Compression Types ==

Uncompressed : 135
Dynamic Huffman : 127515

Of course, a simple run of zeros compressed with gz would also serve as a test case.

A much more involved implementation could gradually try and split blocks when high compression ratios are detected (as early as possible). This would also fix the problem with the work distribution. However, this might require a redesign of the block index that is used for prefetching and for decompressing gzip blocks the very first time in the initial run. It already was a nightmare to get it correct with the simple approach of:

  1. assign actual block indexes for the first n known/decoded blocks
  2. assign uniformly distributed offsets (< file size) to unknown block index ranges

Maybe, splitting the block index into a pair to denote these different "phase state ranges" could help simplify prefetching in this regard but it would become even more complicated with the dynamic partitioning for work balancing.

mxmlnkn commented 1 year ago

There is another use case that needs to be included in the considerations! A single deflate block that is several gigabytes large. This can happen when compressing with igzip -0. E.g.:

base64 /dev/urandom | head -c $(( 4*1024*1024*1024 )) > 4GiB-base64
igzip -0 -k -c 4GiB-base64 > 4GiB-base64.igz
igzip --version
    igzip command line interface 2.30.0
pragzip --analyze 4GiB-base64.igz

Output shows a single dynamic Huffman coding deflate block that contains the whole 4 GiB of data!

Gzip header:
    Gzip Stream Count   : 1
    Compressed Offset   : 0 B 0 b
    Uncompressed Offset : 0 B
    File Name           : 4GiB-base64
    Modification Time   : 1677534225
    OS                  : Unix
    Flags               : none

Deflate block:
    Final Block             : False
    Compression Type        : Dynamic Huffman
    File Statistics:
        Total Block Count   : 1
        Compressed Offset   : 22 B 0 b
        Uncompressed Offset : 0 B
    Gzip Stream Statistics:
        Block Count         : 1
        Compressed Offset   : 22 B 0 b
        Uncompressed Offset : 0 B
    Compressed Size         : 4557139977 B 4 b
    Uncompressed Size       : 4294967296 B
    Compression Ratio       : 0.94247
    Huffman Alphabets:
        Precode  : 14 CLs in [2, 7] out of 19: CL:Count, 0:5, 2:2, 3:1, 4:5, 6:2, 7:4, 
        Distance : 30 CLs in [3, 9] out of 30: CL:Count, 3:1, 4:7, 5:11, 6:4, 7:2, 8:3, 9:2, 
        Literals : 286 CLs in [2, 15] out of 286: CL:Count, 2:1, 4:2, 5:3, 6:8, 7:13, 8:21, 9:54, 10:63, 11:112, 12:2, 13:3, 15:4, 
    Symbol Types:
        Literal         : 4287250842 (99.9552 %)
        Back-References : 1921508 (0.044799 %)

Deflate block:
    Final Block             : True
    Compression Type        : Fixed Huffman
    File Statistics:
        Total Block Count   : 2
        Compressed Offset   : 4557139999 B 4 b
        Uncompressed Offset : 4294967296 B
    Gzip Stream Statistics:
        Block Count         : 2
        Compressed Offset   : 4557139999 B 4 b
        Uncompressed Offset : 4294967296 B
    Compressed Size         : 1 B 2 b
    Uncompressed Size       : 0 B
    Compression Ratio       : 0
    Symbol Types:
        Literal         : 0 (-nan %)
        Back-References : 0 (-nan %)

Validated CRC32 0x76658628 for gzip stream!
Bit reader EOF reached at 4557140009 B 0 b

== Benchmark Profile (Cumulative Times) ==

readDynamicHuffmanCoding : 0.0059761 s (0.0197675 %)
readData                 : 30.226 s (99.9802 %)
Dynamic Huffman Initialization in Detail:
    Read precode       : 3.95e-06 s (0.0660966 %)
    Create precode HC  : 0.00459211 s (76.8413 %)
    Apply precode HC   : 1.99e-06 s (0.0332993 %)
    Create distance HC : 3.17e-05 s (0.530446 %)
    Create literal HC  : 0.00134635 s (22.5289 %)

Not only does this lead to memory usage growing as large as the whole file size but it also conceptually makes parallelization impossible.

mxmlnkn commented 1 year ago

The case of large compression ratios but relative small deflate blocks has been resolved with https://github.com/mxmlnkn/indexed_bzip2/commit/a0ed0e5296b94f093b9b99fc6f9bea90d4fce5e6. However, the case of a single large deflate block still might lead to out of memory exceptions. The problem here is that decompression cannot be simply stopped and then resumed by the cache-prefetch hierarchy. Instead, ChunkData would have to be amended to be able to resume decompression. In some respect, we would basically have to put a whole non-parallel GzipReader object into the chunk data in that case and use that to stream out the result or even emulate seeking by beginning decompression from the start.

I think, it might be sufficient for a production-ready version, to simply throw an exception when a single deflate block larger than the maximum decompressed chunk size has been detected. It should be a sufficiently rare case that we might get away not supporting that. But at least the exception would avoid triggering the OOM-killer and doing harm on the system.

An exception is now thrown for deflate blocks decompressing to more than 256 MiB: https://github.com/mxmlnkn/indexed_bzip2/commit/1f4eac8e1467e0c64cc3bd623eff70a1609dad94 Note that simply runs of 0 did create 8 kiB deflate blocks decompressing to 8 MiB each when compressed with GNU gzip with default settings.