KillingSpark commented 1 year ago

This introduces an implementation of the slice-by-16 algorithm that calculates the crc of 16 bytes in one step instead of iterating the input bytewise. As a tradeoff the necessary lookup-table is 16x bigger, so in case of a 32bit crc it's 16kB.

Should I add implementations for the other widths in this PR or do you prefer separate PRs? I'd imagine review is simpler in separate PRs but merging the benchmarks might get annoying. I don't really care, I can do both.

Benchmark on a Ryzen 7 3800X:

$ cargo bench 32

     Running benches/bench.rs (target/release/deps/bench-9c684e23a65bab14)
checksum/crc32          time:   [30.355 µs 30.357 µs 30.360 µs]
                        thrpt:  [514.66 MiB/s 514.70 MiB/s 514.75 MiB/s]

checksum/crc32_slice16  time:   [4.5740 µs 4.5812 µs 4.5893 µs]
                        thrpt:  [3.3249 GiB/s 3.3308 GiB/s 3.3360 GiB/s]

checksum/baseline       time:   [3.7496 µs 3.7515 µs 3.7537 µs]
                        thrpt:  [4.0650 GiB/s 4.0674 GiB/s 4.0694 GiB/s]

Benchmark on a Mac Studio:

checksum/crc32          time:   [41.819 µs 41.900 µs 41.966 µs]
                        thrpt:  [372.32 MiB/s 372.91 MiB/s 373.63 MiB/s]

checksum/crc32_slice16  time:   [4.6718 µs 4.6765 µs 4.6808 µs]
                        thrpt:  [3.2599 GiB/s 3.2629 GiB/s 3.2661 GiB/s]

checksum/baseline       time:   [1.1513 µs 1.1525 µs 1.1537 µs]
                        thrpt:  [13.226 GiB/s 13.240 GiB/s 13.254 GiB/s]

Open questions

Should this be the default?

I don't think a 16kB lookup table is too much and this would automatically improve speeds in all using crates. I could swap the logic pretty easily and make this the impl for Crc<u32> and provide a BytewiseU32 instead, that uses the current Crc<u32> implementation. Otherwise the improvements will probably take quite a while to actually reach end-users.

To-dos

[x] Need to remove the const generic trick used to build the tables because it's not supported in rust 1.46

meteor-lsw commented 1 year ago

what is the baseline mean? it looks like faster than crc32_slice16.

KillingSpark commented 1 year ago

Sorry should have mentioned that in the PR description too. The baseline just iterates the data and sums it up. It's meant to show how fast the machine can iterate the data and do one simple instruction per byte.

akhilles commented 1 year ago

Should I add implementations for the other widths in this PR or do you prefer separate PRs?

My preference would be to keep this PR to just one width, and do the remaining widths in a second PR.

Should this be the default?

Yeah, I think it should be default. 16KiB is fairly small in terms of binary size and memory usage. I'm not sure about u128 though, the table would be 64KiB which may exceed L1 cache size and impact throughput.

akhilles commented 1 year ago

Weirdly enough, crc32_slice16 is faster than baseline on my machine (i5-1240P):

checksum/baseline       time:   [3.7320 µs 3.7339 µs 3.7361 µs]
                        thrpt:  [4.0841 GiB/s 4.0866 GiB/s 4.0887 GiB/s]
checksum/crc32_slice16  time:   [2.9277 µs 2.9336 µs 2.9408 µs]
                        thrpt:  [5.1887 GiB/s 5.2013 GiB/s 5.2119 GiB/s]

KillingSpark commented 1 year ago

My preference would be to keep this PR to just one width, and do the remaining widths in a second PR.

Great :+1:

Weirdly enough, crc32_slice16 is faster than baseline on my machine (i5-1240P):

That's fascinating, because looking at the assembly for

data.iter().fold(0usize, |acc, v| acc.wrapping_add(*v as usize))

the compiler is unrolling the loop and everything. I'd assumed that this would surely be faster than doing the multiple xors and loading from the lookup tables.

Yeah, I think it should be default. 16KiB is fairly small in terms of binary size and memory usage. I'm not sure about u128 though, the table would be 64KiB which may exceed L1 cache size and impact throughput.

I'll do that soon then

KillingSpark commented 1 year ago

Another question: Should I also add a version with no lookup table? It wouldn't be much work now and it would resolve the original issue #57

akhilles commented 1 year ago

Should I also add a version with no lookup table?

That'd be great! But maybe in a follow-up PR.

KillingSpark commented 1 year ago

About the default algortihm: I think its better to first get all the the Slice16<W>, Bytewise<W>, and NoLookup<W> implementations and then for each width decide which implementation should be the default.

Duplicating as much code as I did in this PR should be avoidable by generating most of the impl Crc<...> {} blocks via a macro. Only the new() and update() functions differ between the slice16, bytewise, and nolookup implementations.

But for the sake of simplicity of this PR I'd only introduce that macro in the PR that adds multiple Slice16 impls if that's ok?

akhilles commented 1 year ago

About the default algortihm: I think its better to first get all the the Slice16<W>, Bytewise<W>, and NoLookup<W> implementations and then for each width decide which implementation should be the default.

Sounds good.

Duplicating as much code as I did in this PR should be avoidable by generating most of the impl Crc<...> {} blocks via a macro. Only the new() and update() functions differ between the slice16, bytewise, and nolookup implementations.

The code duplication isn't ideal, but I'm also not a big fan of macro codegen. My preference would be to wait for const fns in traits to become stable and deal with the duplicated code for now.

Would you mind squashing the commits and adding a more detailed commit message (maybe copy from PR desc.). Then I think we can merge this. The changes LGTM.

KillingSpark commented 1 year ago

I think it's easiest if you squash the commits when you merge the PR.

Introduce an implementation of the slice-by-16 algorithm that calculates the crc of 16 bytes in one step instead of iterating the input bytewise. Would be a good squash commit message.

mrhooray / crc-rs

Add slice16 #77

Open questions

Should this be the default?

To-dos