mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
345 stars 7 forks source link

Support for other specialized gzip formats similar to BGZF #8

Closed mxmlnkn closed 1 month ago

mxmlnkn commented 1 year ago

Rapidgzip already supports BGZF files and also indexes created by indexed_gzip. Support for more obscure similar gzip file formats would be necessary because gzip files consisting only of multiple gzip streams with each one (final) deflate block will not be found by the block finder yet. A generic solution should add a third blockfinder that looks for gzip stream header magic bytes, which should be sufficiently fast.

qatzip also adds the length of the compressed block into the extra bytes of the header. Quotes from the ReadMe:

Note that it seems that it creates 16 KiB chunks (?) when using hardware acceleration, but when it has to fall back to the software implementation, it creates [512 MiB chunks](), I think. In a test with 4 GiB Random data, it created 9 (not 8... dunno why) gzip streams for the same file.

I was not able to build any of qatzip, qatlib, or quickassist (neither v2.0 or v1.7) from source. The former because of missing quickassist and the latter because of compile errors because of Linux kernel incompatibilities to 6.x (when compiling v1.7) or other errors I did not understand when compiling v2.0. But it can be installed on AlmaLinux via yum:

docker run -it almalinux:9 bash
> yum install -y qatzip
> for (( i = 0; i < 1024; ++i )); do head -c $(( 4 * 1024 * 1024 )) /dev/urandom | base64 | qzip -O gzipext > base64.$i.gz; done
> for (( i = 0; i < 1024; ++i )); do cat base64.$i.gz; done > base64-4GiB.gz
docker cp eaaab87d9c37:/base64-4GiB.gz base64-4GiB.qgz
rapidgzip --analyze base64-4GiB.qgz | grep -i -A 10 'Gzip header:'

Partial output:

Gzip header:
    Gzip Stream Count   : 529
    Compressed Offset   : 2273194337 B 0 b
    Uncompressed Offset : 2991644304 B
    Modification Time   : 0
    OS                  : unknown
    Flags               : none
    Extra               : 12 B: QZ\x08\x00\xc9tV\x00\xd8\xb1A\x00

For some reason, appending with >> does not work with qatzip. I think it might hijack the output file and clear it somehow. That's why I used this roundabout way.

mgzip and its fork pgzip seem to have the same format as BGZF but it has a different ID IG instead of BC. This should be trivial to support.

There also is an identically named pgzip Go module, but as far as I can see it adds no metadata that might help in decompression, however, it does create independent small gzip streams, so decompression should be able to use ISA-l, if further optimized. Currently, it requires the decoded size or the exact encoded stopping offset, but if the stopping condition is implemented correctly or if we can defer to ISA-l inside DeflateBlock when there are no markers and no statistics requested, then it should also help very much. That should also benefit the base64 example.

gztool writes out an index similar but different to the one created by indexed_gzip.

pgzf seemed like a very young project but it is part of wtdbg2, which can be installed with apt. I don't see the difference between it and bgzip and why I would use it instead of bgzip. It also has its own format which seems to be a mix of a concatenated index and gzip file header metadata pointing to it.

PGZF fellows standard GZIP format (rfc1952), and is blocked compressed. It defines two TAGs in each GZIP header, ZS: block size, ZX: random access index. Program pgzf can decompress .pgzf and .gz files. When decompressing .gz files, pgzf is in fact a buffered gzip reader. Also, .pgzf files can be decompressed by program gzip.

migz is very similar to the others and stores magic bytes MZ followed by the compressed size.

Bgzip has an index file BGZI, ~which I currently do not support because parsing a bgzf is sufficiently fast~. But, loading the index might have advantages for very large files stored on HDDs as opposed to SSDs. Also, it makes random access possible without lots of seeks and reads to the whole archive.

The file contents are:

 uint64_t number_entries

followed by number_entries pairs of:

 uint64_t compressed_offset
 uint64_t uncompressed_offset

Unfortunately, it does not seem to have a magic byte header. Instead, we can check that the file size is equal to 2 8 number_entries + 8.

Bgzip index import support has been added in 5682ef167f404255684c1c46bc6f3d4928360122.

gzp generates BGZF or the mgzip format, so no additional support work is needed for this.

Dictzip. Same as the others with magic bytes 'RA' (random access).

Stalled or small projects:

mxmlnkn commented 1 year ago

A generic solution should add a third blockfinder that looks for gzip stream header magic bytes, which should be sufficiently fast.

It might be even easier to simply skip the final block bit flag. This would "only" double the number of false positives. Still, it would negatively impact overall performance, so it needs to be benchmarked. Looking for gzip headers also would negatively impact performance and needs to be benchmarked. A first clue would be given by the gzip header finder speed.

mxmlnkn commented 11 months ago

Simply allowing the final block bit to be set, almost halves the block finder performance! Because of this, looking for gzip headers might be faster then even if it requires a second pass over the data.

Requiring final block flag to be 0

m benchmarkGzipBlockFinder && src/benchmarks/benchmarkGzipBlockFinder
[findDeflateBlocksRapidgzipLUT with 13 bits, Walk Tree Compressed LUT] ( 63.1 <= 63.8 +- 0.4 <= 64.4 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 14 bits, Walk Tree Compressed LUT] ( 64.42 <= 65.00 +- 0.22 <= 65.14 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 15 bits, Walk Tree Compressed LUT] ( 60.5 <= 63.8 +- 1.6 <= 64.7 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 16 bits, Walk Tree Compressed LUT] ( 65.22 <= 65.54 +- 0.22 <= 66.04 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 17 bits, Walk Tree Compressed LUT] ( 63.2 <= 65.0 +- 0.9 <= 66.1 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 18 bits, Walk Tree Compressed LUT] ( 64.86 <= 65.02 +- 0.11 <= 65.27 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 13 bits, Walk Tree LUT] ( 66.29 <= 66.91 +- 0.29 <= 67.33 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 14 bits, Walk Tree LUT] ( 66.54 <= 67.04 +- 0.25 <= 67.51 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 15 bits, Walk Tree LUT] ( 65.8 <= 66.9 +- 1.2 <= 70.1 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 16 bits, Walk Tree LUT] ( 68.0 <= 69.0 +- 0.4 <= 69.4 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 17 bits, Walk Tree LUT] ( 66.9 <= 67.9 +- 0.5 <= 68.3 ) MB/s (candidates: 495)
[findDeflateBlocksRapidgzipLUT with 18 bits, Walk Tree LUT] ( 62.3 <= 65.4 +- 2.3 <= 67.6 ) MB/s (candidates: 495)

Ignoring final block flag

[findDeflateBlocksRapidgzipLUT with 13 bits, Walk Tree Compressed LUT] ( 34.7 <= 37.3 +- 0.9 <= 38.0 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 14 bits, Walk Tree Compressed LUT] ( 36.6 <= 37.2 +- 0.4 <= 37.6 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 15 bits, Walk Tree Compressed LUT] ( 35.9 <= 36.8 +- 0.6 <= 38.0 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 16 bits, Walk Tree Compressed LUT] ( 37.49 <= 37.96 +- 0.27 <= 38.31 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 17 bits, Walk Tree Compressed LUT] ( 36.0 <= 37.8 +- 0.7 <= 38.3 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 18 bits, Walk Tree Compressed LUT] ( 36.5 <= 37.0 +- 0.4 <= 37.6 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 13 bits, Walk Tree LUT] ( 37.7 <= 38.7 +- 0.4 <= 39.1 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 14 bits, Walk Tree LUT] ( 37.44 <= 38.03 +- 0.24 <= 38.30 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 15 bits, Walk Tree LUT] ( 35.8 <= 37.6 +- 1.1 <= 38.7 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 16 bits, Walk Tree LUT] ( 37.4 <= 37.8 +- 0.3 <= 38.3 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 17 bits, Walk Tree LUT] ( 35.3 <= 36.2 +- 0.6 <= 37.1 ) MB/s (candidates: 496)
[findDeflateBlocksRapidgzipLUT with 18 bits, Walk Tree LUT] ( 34.6 <= 35.8 +- 0.6 <= 36.6 ) MB/s (candidates: 496)
diff --git a/src/rapidgzip/blockfinder/DynamicHuffman.hpp b/src/rapidgzip/blockfinder/DynamicHuffman.hpp
index b66e0935..0c8778de 100644
--- a/src/rapidgzip/blockfinder/DynamicHuffman.hpp
+++ b/src/rapidgzip/blockfinder/DynamicHuffman.hpp
@@ -35,10 +35,9 @@ isDeflateCandidate( uint32_t bits )
     if constexpr ( bitCount == 0 ) {
         return false;
     } else {
-        /* Bit 0: final block flag */
-        const auto isLastBlock = ( bits & 1U ) != 0;
+        /* Bit 0: final block flag. May be 0 or 1. */
         bits >>= 1U;
-        bool matches = !isLastBlock;
+        bool matches = true;
         if constexpr ( bitCount <= 1U ) {
             return matches;
         }