Open akhilles opened 1 year ago
If you want to get significant speedup with SIMD, (especially on x86) you should implement the algorithm using carry-less multiplications.
I can help giving pointers how to do this practically.
Hi @snakehand I've been thinking about this a bit, sadly the paper you linked isn't available anymore (at least not under the link and a quick google only turns up a whitepaper about carry less mult for galois counter mode) but am I right if I think the carry-less multiplication would be used somewhat like this? (Ignoring reflection etc etc for conciseness)
fn crc(poly: u64, crc: u64, bytes: &[u8]) -> u64 {
let mut idx = 0;
while bytes.len() - idx >= 8 {
let next_data = load_u64(bytes, idx);
let multiplicated: u128 = carry_less_mult(crc ^ next_data, poly);
// Question: Are the higher bits of any significance anymore?
// They are equivalent to what we shift out / throw away in the "normal" implementations right?
crc = lower_bits(multiplicated);
idx += 8;
}
// deal with remainder
crc
}
The document is available here :
The speedup comes from using the carryless multiplication in bigger data units, and using Barett reduction to compute the final smaller CRC.
Started doing preliminary work here, no simd yet just understanding the algorithm: https://github.com/KillingSpark/crc-rs/tree/clmul
Interestingly enough this is ~2x faster than the current table-less implementation even without any real thought on optimization and especially with the lack of any simd. Might be worth using this even if the simd instructions aren't available for a specific target.
Blocked by https://github.com/rust-lang/rust/issues/86656.