mrhooray / crc-rs

Rust implementation of CRC(16, 32, 64) with support of various standards
Apache License 2.0
187 stars 49 forks source link

Add SIMD implementations if/when `std::simd` is stable #83

Open akhilles opened 1 year ago

akhilles commented 1 year ago

Blocked by https://github.com/rust-lang/rust/issues/86656.

snakehand commented 1 year ago

If you want to get significant speedup with SIMD, (especially on x86) you should implement the algorithm using carry-less multiplications.

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

I can help giving pointers how to do this practically.

KillingSpark commented 9 months ago

Hi @snakehand I've been thinking about this a bit, sadly the paper you linked isn't available anymore (at least not under the link and a quick google only turns up a whitepaper about carry less mult for galois counter mode) but am I right if I think the carry-less multiplication would be used somewhat like this? (Ignoring reflection etc etc for conciseness)

fn crc(poly: u64, crc: u64, bytes: &[u8]) -> u64 {
    let mut idx = 0;
    while bytes.len() - idx >= 8 {
        let next_data = load_u64(bytes, idx);
        let multiplicated: u128 = carry_less_mult(crc ^ next_data, poly);
        // Question: Are the higher bits of any significance anymore? 
        // They are equivalent to what we shift out / throw away in the "normal" implementations right?
        crc = lower_bits(multiplicated);
        idx += 8;
    }
    // deal with remainder
    crc
}
snakehand commented 9 months ago

The document is available here :

https://github.com/tpn/pdfs/blob/master/Fast%20CRC%20Computation%20for%20Generic%20Polynomials%20Using%20PCLMULQDQ%20Instruction%20-%20Intel%20(December%2C%202009).pdf

The speedup comes from using the carryless multiplication in bigger data units, and using Barett reduction to compute the final smaller CRC.

KillingSpark commented 8 months ago

Started doing preliminary work here, no simd yet just understanding the algorithm: https://github.com/KillingSpark/crc-rs/tree/clmul

Interestingly enough this is ~2x faster than the current table-less implementation even without any real thought on optimization and especially with the lack of any simd. Might be worth using this even if the simd instructions aren't available for a specific target.