Closed KillingSpark closed 1 year ago
what is the baseline mean? it looks like faster than crc32_slice16.
Sorry should have mentioned that in the PR description too. The baseline just iterates the data and sums it up. It's meant to show how fast the machine can iterate the data and do one simple instruction per byte.
Should I add implementations for the other widths in this PR or do you prefer separate PRs?
My preference would be to keep this PR to just one width, and do the remaining widths in a second PR.
Should this be the default?
Yeah, I think it should be default. 16KiB is fairly small in terms of binary size and memory usage. I'm not sure about u128
though, the table would be 64KiB which may exceed L1 cache size and impact throughput.
Weirdly enough, crc32_slice16
is faster than baseline
on my machine (i5-1240P):
checksum/baseline time: [3.7320 µs 3.7339 µs 3.7361 µs]
thrpt: [4.0841 GiB/s 4.0866 GiB/s 4.0887 GiB/s]
checksum/crc32_slice16 time: [2.9277 µs 2.9336 µs 2.9408 µs]
thrpt: [5.1887 GiB/s 5.2013 GiB/s 5.2119 GiB/s]
My preference would be to keep this PR to just one width, and do the remaining widths in a second PR.
Great :+1:
Weirdly enough, crc32_slice16 is faster than baseline on my machine (i5-1240P):
That's fascinating, because looking at the assembly for
data.iter().fold(0usize, |acc, v| acc.wrapping_add(*v as usize))
the compiler is unrolling the loop and everything. I'd assumed that this would surely be faster than doing the multiple xors and loading from the lookup tables.
Yeah, I think it should be default. 16KiB is fairly small in terms of binary size and memory usage. I'm not sure about u128 though, the table would be 64KiB which may exceed L1 cache size and impact throughput.
I'll do that soon then
Another question: Should I also add a version with no lookup table? It wouldn't be much work now and it would resolve the original issue #57
Should I also add a version with no lookup table?
That'd be great! But maybe in a follow-up PR.
About the default algortihm: I think its better to first get all the the Slice16<W>
, Bytewise<W>
, and NoLookup<W>
implementations and then for each width decide which implementation should be the default.
Duplicating as much code as I did in this PR should be avoidable by generating most of the impl Crc<...> {}
blocks via a macro. Only the new()
and update()
functions differ between the slice16, bytewise, and nolookup implementations.
But for the sake of simplicity of this PR I'd only introduce that macro in the PR that adds multiple Slice16 impls if that's ok?
About the default algortihm: I think its better to first get all the the
Slice16<W>
,Bytewise<W>
, andNoLookup<W>
implementations and then for each width decide which implementation should be the default.
Sounds good.
Duplicating as much code as I did in this PR should be avoidable by generating most of the
impl Crc<...> {}
blocks via a macro. Only thenew()
andupdate()
functions differ between the slice16, bytewise, and nolookup implementations.
The code duplication isn't ideal, but I'm also not a big fan of macro codegen. My preference would be to wait for const fns in traits to become stable and deal with the duplicated code for now.
Would you mind squashing the commits and adding a more detailed commit message (maybe copy from PR desc.). Then I think we can merge this. The changes LGTM.
I think it's easiest if you squash the commits when you merge the PR.
Introduce an implementation of the slice-by-16 algorithm that calculates the crc of 16 bytes in one step instead of iterating the input bytewise.
Would be a good squash commit message.
This introduces an implementation of the slice-by-16 algorithm that calculates the crc of 16 bytes in one step instead of iterating the input bytewise. As a tradeoff the necessary lookup-table is 16x bigger, so in case of a 32bit crc it's 16kB.
Should I add implementations for the other widths in this PR or do you prefer separate PRs? I'd imagine review is simpler in separate PRs but merging the benchmarks might get annoying. I don't really care, I can do both.
Benchmark on a Ryzen 7 3800X:
Benchmark on a Mac Studio:
Open questions
Should this be the default?
I don't think a 16kB lookup table is too much and this would automatically improve speeds in all using crates. I could swap the logic pretty easily and make this the impl for
Crc<u32>
and provide aBytewiseU32
instead, that uses the currentCrc<u32>
implementation. Otherwise the improvements will probably take quite a while to actually reach end-users.To-dos