mrhooray / crc-rs

Rust implementation of CRC(16, 32, 64) with support of various standards
Apache License 2.0
193 stars 49 forks source link

Add SIMD implementations if/when `std::simd` is stable #83

Open akhilles opened 1 year ago

akhilles commented 1 year ago

Blocked by https://github.com/rust-lang/rust/issues/86656.

snakehand commented 1 year ago

If you want to get significant speedup with SIMD, (especially on x86) you should implement the algorithm using carry-less multiplications.

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

I can help giving pointers how to do this practically.

KillingSpark commented 11 months ago

Hi @snakehand I've been thinking about this a bit, sadly the paper you linked isn't available anymore (at least not under the link and a quick google only turns up a whitepaper about carry less mult for galois counter mode) but am I right if I think the carry-less multiplication would be used somewhat like this? (Ignoring reflection etc etc for conciseness)

fn crc(poly: u64, crc: u64, bytes: &[u8]) -> u64 {
    let mut idx = 0;
    while bytes.len() - idx >= 8 {
        let next_data = load_u64(bytes, idx);
        let multiplicated: u128 = carry_less_mult(crc ^ next_data, poly);
        // Question: Are the higher bits of any significance anymore? 
        // They are equivalent to what we shift out / throw away in the "normal" implementations right?
        crc = lower_bits(multiplicated);
        idx += 8;
    }
    // deal with remainder
    crc
}
snakehand commented 11 months ago

The document is available here :

https://github.com/tpn/pdfs/blob/master/Fast%20CRC%20Computation%20for%20Generic%20Polynomials%20Using%20PCLMULQDQ%20Instruction%20-%20Intel%20(December%2C%202009).pdf

The speedup comes from using the carryless multiplication in bigger data units, and using Barett reduction to compute the final smaller CRC.

KillingSpark commented 11 months ago

Started doing preliminary work here, no simd yet just understanding the algorithm: https://github.com/KillingSpark/crc-rs/tree/clmul

Interestingly enough this is ~2x faster than the current table-less implementation even without any real thought on optimization and especially with the lack of any simd. Might be worth using this even if the simd instructions aren't available for a specific target.