mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
344 stars 7 forks source link

Add CRC32 calculation #5

Closed mxmlnkn closed 1 year ago

mxmlnkn commented 1 year ago

Similar to pugz, CRC32 is currently not yet implemented because it introduces performance and complexity overhead and because in my opinion the fact that the end of the file can be reached is already quite a strong sanity check.

In order to parallelize CRC32 combination, using the linearity of CRC32 might work. The index could also add checksums for each deflate block or chunk to add more fine-granular checks when the index exists.

Vadiml1024 commented 1 year ago

I've just added a naive SSE4.2 based CRC32 calc, I see there are a lot of benchmarking code, any advice on using it to evaluate perfomance of my patch?

mxmlnkn commented 1 year ago

I think something like crc32 would deserve its own benchmark file. I would generate some random data and then run crc32 on it. Displaying the result as a bandwidth makes it comparable to all other benchmarks. Anything above 500 MB/s shouldn't image gzip decompression (~250 MB/s) very much except for non-compressed gzip files (> 4 GB/s).

mxmlnkn commented 1 year ago

Anything above 500 MB/s shouldn't image gzip decompression (~250 MB/s) very much except

This is under the assumption that CRC32 itself can be computed in parallel and then the separate results can be simply combined, else I'd like to see CRC32 bandwidths of 10 GB/s and more...

Also, I'm not sure whether SSE4 would be necessary. There are other algorithms like Slice-by-N I would also try. But this would be something nice for the benchmark file, to compare different CRC32 implementations against each other.

Vadiml1024 commented 1 year ago

It seems that my naive approach to accelerate CRC32 did not work. No signifciant diffrence of performance between develop branch and my sse4 branch: This is my patch

mxmlnkn commented 1 year ago

How are you testing this? The CRC32 is only calculated for -P 1 I think. Try to add debug output or a breakpoint to test whether the code is even called. Or maybe you are using clang? Then the -msse4.2 should also be added for clang. Maybe also try -march=native.

Vadiml1024 commented 1 year ago

I'm using gcc for compiling The command to test is as foolows: (time pragzip -P 1 -v -o /dev/null ~/Downloads/usr1.tar.gz; time pragzip-sse4 -P 1 -v -o /dev/null ~/Downloads/usr1.tar.gz) | tee /tmp/pragzip-test.log

The result:

vadim@vadim-tp:~/work/pragzip/build$ cat /tmp/pragzip-test.log 
file path for input: /home/vadim/Downloads/usr1.tar.gz
file path for output: /dev/null
Decompress /home/vadim/Downloads/usr1.tar.gz -> /dev/null
Decompressed in total 10128721920 B in 127.137 s -> 79.6675 MB/s

real    2m7,148s
user    2m4,854s
sys     0m1,527s
file path for input: /home/vadim/Downloads/usr1.tar.gz
file path for output: /dev/null
Decompress /home/vadim/Downloads/usr1.tar.gz -> /dev/null
Decompressed in total 10128721920 B in 130.311 s -> 77.7271 MB/s

real    2m10,319s
user    2m8,273s
sys     0m1,567s
vadim@vadim-tp:~/work/pragzip/build$ 

With -P 0 the result is approx x2 as fast (i have 4 cores) on my machine

mxmlnkn commented 1 year ago

These are not the speeds I'd want to see :/. Could you please try to run with -v. There should be a nice profiling summary when pragzip finishes. Also, could you try the develop branch (on indexed_bzip2, you can simply add this as another remote to your existing pragzip clone).

Edit: Ok, if you only have 4 (virtual) cores, ergo 2 physical cores, then a speedup of 2 sounds realistic. A speedup of 2 might even be somewhat realistic for 4 cores though. A direct comparison to gzip would be interesting. On my system, gzip can achieve ~200 MB/s per core.

Vadiml1024 commented 1 year ago

Actually i've branched my sse4 branch off you develop branch... so pragzip command is built on develop branch... I did use the -v flag it seems there is no much output with -P 1 which i used to be sure that CRC32 is called. Attached is the output in -P 0 case

mxmlnkn commented 1 year ago

Yes, there is no profiling for -P 1 because it uses the serial decompressor GzipReader directly instead of the huge code architecture required for parallel decompression ParallelGzipReader. The log looks good on all accounts, most of the time is spent in actual gzip decompression and there are almost no deflate block cache misses. I'm not sure why the two timings vary so much. Either there are other background processes or the cores are throttled for some reason i the second run. There is no indication in the profiling data that anything algorithmic is different.

I guess you just have a "slow" CPU similar to one of the server CPUs running at 2 GHz, which I used for my scaling benchmarks up to 128 cores. A direct comparison to gzip would be interesting.

If pragzip -P 1 is slower than gzip, then it would be a longterm optimization problem because I'd have to touch the gzip decompression code again. However, when loading an index, zlib should be used anyway, which might be a bit faster for several reasons (again, this might only be true for -P != 1).

Vadiml1024 commented 1 year ago

Here are the gunzip results:

time gunzip  <~/Downloads/usr1.tar.gz >/dev/null

real    1m39,193s
user    1m37,509s
sys     0m1,435s
mxmlnkn commented 1 year ago

The intrinsic calculates CRC-32C instead of CRC-32. I'm not sure whether they can somehow be converted to each other, it doesn't seem so. Therefore, it probably cannot be used for the gzip CRC-32 check.

Vadiml1024 commented 1 year ago

In that case why pragzip is not complaining?

mxmlnkn commented 1 year ago

See my comments in your linked patch commit. You only added the CRC32 for non-compressed blocks. There might be no non-compressed blocks in your test data.

mxmlnkn commented 1 year ago

Furthermore, I'm sorry about that but I think I mentioned that CRC32 is not used for the parallel code and I wasn't sure whether it is used for the serial code... Turns out it isn't :( See pragzip.cpp:

pragzip::GzipReader</* CRC32 */ false> gzipReader{ std::move( inputFile ) };

Before and after flipping that flag:

Decompressed in total 536870912 B in 2.21557 s -> 242.317 MB/s
> m pragzip && src/tools/pragzip -v -d -o /dev/null -P 1 test-files/small/small.gz
Decompressed in total 536870912 B in 2.81561 s -> 190.677 MB/s

22% slowdown caused by CRC32.

Vadiml1024 commented 1 year ago

Ahh I see... But is suppose this slowdown with the orginila CRC32. Well, given the fact that CRC32 cal is disabled by default i wonder why one needs to enable it?

Anyway, I've recruited chatGPT for help which produced the following using the correct polynomial. I did not test it yet, but it seems correct on the first glance I'm not sure however where to add the call to it... In appendToWindow on the window overflow condition? But what to do with the last not fully filled window?

#include <cstdint>
#include <cstddef>
#include <immintrin.h>

static constexpr uint32_t CRC_POLY = 0xEDB88320;

// Precalculate the CRC lookup table
static uint32_t CRC_TABLE[256];

static void init_crc_table() {
    for (uint32_t i = 0; i < 256; ++i) {
        uint32_t crc = i;
        for (int j = 0; j < 8; ++j) {
            if (crc & 1) {
                crc = CRC_POLY ^ (crc >> 1);
            } else {
                crc >>= 1;
            }
        }
        CRC_TABLE[i] = crc;
    }
}

// Update 8 bytes using AVX2 instructions
static inline __m256i crc32_avx2(__m256i crc, const uint8_t* data) {
    __m256i data_reg = _mm256_lddqu_si256((__m256i const*)data);

    // XOR the data with the current CRC value
    __m256i xor_data = _mm256_xor_si256(data_reg, crc);

    // Use a mask to select the correct element of the CRC table for each element of the vector
    __m256i mask = _mm256_set_epi32(0x03020100, 0x03020100, 0x03020100, 0x03020100,
                                    0x03020100, 0x03020100, 0x03020100, 0x03020100);
    __m256i lookup = _mm256_i32gather_epi32((const int*)CRC_TABLE, xor_data & _mm256_set1_epi32(0xFF), 4);
    lookup = _mm256_permutevar8x32_epi32(lookup, mask);

    // Shift the CRC right by 8 bits
    crc = _mm256_srli_epi32(crc, 8);

    // XOR the lookup table with the shifted CRC value
    crc = _mm256_xor_si256(crc, lookup);

    return crc;
}

// Update the CRC value for the remaining data
uint32_t crc32_remainder_avx2(uint32_t crc, const uint8_t* data, size_t size) {
    __m256i crc_reg = _mm256_set1_epi32(crc ^ 0xFFFFFFFF);

    // Update 8 bytes at a time
    while (size >= 8) {
        crc_reg = crc32_avx2(crc_reg, data);
        data += 8;
        size -= 8;
    }

    // Extract the final CRC value from the vector
    crc_reg = _mm256_xor_si256(crc_reg, _mm256_set1_epi32(0xFFFFFFFF));
    uint32_t result = _mm256_extract_epi32(crc_reg, 0);
    return result;
}
mxmlnkn commented 1 year ago

As far as I understand, this SIMD version simply uses the AVX table lookup to speed up computation. But it would have to be benchmarked whether it is actually faster than using a larger table and no SIMD.

Btw, you can write c++ after the starting triple backticks to enable syntax highlighting for code in markdown.

In appendToWindow on the window overflow condition? But what to do with the last not fully filled window?

I'm not sure I follow. updateCRC32 is called two times in deflate.hpp. You would need to replace both. The problem I see is that currently it works byte-wise inside appendToWindow, which makes it hard to use it with SIMD.

Alternatively, you could check in which loops appendToWindow is called and do the CRC32 calculation after those loops with the correct data. This would have to be in readInternalUncompressed and readInternalCompressed. All of this might get refactored by doing the CRC32 as a kind of post-processing step either outside of deflate::Block or inside/after deflate::Block::read. read returns a range of newly decompressed bytes. In theory, we could therefore shim in between the read return value and do the CRC32 computation on that range. This might require another indirection... read -> readWithoutCRC32 -> readInternal ... Note that the range may not be aligned but SIMD access requires aligned vector access, so not only the tail but also the head would have to be handled.

Vadiml1024 commented 1 year ago

I was thinking about appendToWindow because it seems that all decompressed bytes are going through this function. So the idea was: when the window is full call updateCRC32 - you don't think it will work?

mxmlnkn commented 1 year ago

Ah ok. Well, the problem is the condition when to trigger the CRC32 update. Only on wrap-around is insufficient as you already pointed out. It seems to me that this can only be done from the calling sites of appendToWindow and, at that point, it would be bettrer to warp this around the public read call.

mxmlnkn commented 1 year ago

https://github.com/mxmlnkn/indexed_bzip2/commit/d23a0d7f88fa17e9080f967cab1a12263e6db620

This can be used as a testing ground because the hashes are printed out, and for comparing benchmarks.

mxmlnkn commented 1 year ago

https://github.com/mxmlnkn/indexed_bzip2/commit/21af9b8a2909552361d851e6694991fe7f161878

Slice-by-N works wonders! It's mesmerizing to see that the same speed with slice-by-N lookup tables can reach similar speeds to using SIMD intrinsics. This isn't even the first or second LUT used in pragzip. I make heavy use of them everywhere. Since this project, I've started to see lookup tables/cpu caches as something akin to FPGAs. You can define any arbitrary byte-to-byte or even word-to-word mapping/operation with a sufficiently fast and large L1 cache.

> m benchmarkCRC32 && src/benchmarks/benchmarkCRC32 

Initializing random data for benchmark... Done (1.43165 s)
[Compute CRC32 (LUT)]           ( min: 516.417, 517.3 +- 1.0, max: 519.571 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 4)]    ( min: 1402.55, 1414  +-   5, max: 1419.92 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 8)]    ( min: 2553.45, 2588  +-  19, max: 2618.4  ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 12)]   ( min: 3602.26, 3760  +-  60, max: 3808.64 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 16)]   ( min: 3869.64, 3970  +-  50, max: 4038.77 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 20)]   ( min: 2586.97, 2627  +-  23, max: 2644.93 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 24)]   ( min: 2956.9 , 2988  +-  12, max: 2997.68 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 32)]   ( min: 2736.25, 2806  +-  29, max: 2828.43 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 64)]   ( min: 2104.77, 2139  +-  13, max: 2150.09 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (_mm_crc32_u32)] ( min: 5212.63, 5280  +-  50, max: 5351.18 ) MB/s -> Result: 0xAFDBD4A7
[Compute CRC32 (_mm_crc32_u64)] ( min: 9012.49, 9700  +- 400, max: 10155.2 ) MB/s -> Result: 0xAFDBD4A7

I wonder if some of those 32-bit operations could be implemented with 64-bit or even SIMD ... The table lookup might even work SIMD.

With explicit loop unrolling (and without -march=native):

Initializing random data for benchmark... Done (1.36239 s)
[Compute CRC32 (LUT)]         ( min: 525.228,  535 +-  6, max: 542.199 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 4)]  ( min: 1422.35, 1464 +- 15, max: 1478.13 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 8)]  ( min: 2644.90, 2668 +-  9, max: 2673.84 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 12)] ( min: 3926.02, 3978 +- 26, max: 4009.95 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 16)] ( min: 4546.23, 4630 +- 40, max: 4670.93 ) MB/s -> Result: 0xFBA351D8 <-
[Compute CRC32 (slice by 20)] ( min: 2628.45, 2653 +- 19, max: 2676.97 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 24)] ( min: 2967.99, 3100 +- 60, max: 3152.44 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 32)] ( min: 2791.56, 2829 +- 20, max: 2845.71 ) MB/s -> Result: 0xFBA351D8
[Compute CRC32 (slice by 64)] ( min: 2173.20, 2222 +- 21, max: 2238.27 ) MB/s -> Result: 0xFBA351D8
Vadiml1024 commented 1 year ago

Yes impressive BTW the __mm_crc_xxx results are different, why?

mxmlnkn commented 1 year ago

Because they compute the CRC-32C (Castagnoli), something completely different and therefore unusable for pragzip. It uses 0x82F63B78 as the generator polynomial while CRC-32 uses 0xEDB88320.

mxmlnkn commented 1 year ago

This kind of CRC can be sped up with PCLMULQDQ, which exists in processors newer than 2010.

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

Then again, this implementation in Rust "only" shows a bandwidth of 7.3 GB/s. This doesn't sound like that much of a deal compared to the 4.6 GB/s, which I have reached with simple lookup tables. Although, it might improve cache behavior by getting rid of those lookup tables. The lookup tables are of size sizeof( uint32_t ) * 256 * slice-multiple, which is 16 KiB for the fastest version. This should fit into most L1 caches but combined with the 32 KiB buffer size for the inflated data, things are getting cramped.

mxmlnkn commented 1 year ago

Added with 08b453f. It adds ~5-6% overhead.

Further to do (might create another issue to track those):