SIMD-based implementations

valaphee commented 10 months ago

The x86 implementation is based on Intel's paper about "Fast CRC Computation for Generic Polynomials using PCLMULQDQ Instruction"

baseline/baseline       time:   [563.00 ns 564.20 ns 565.65 ns]
                        thrpt:  [26.976 GiB/s 27.045 GiB/s 27.103 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
crc32/default           time:   [31.258 µs 31.264 µs 31.269 µs]
                        thrpt:  [499.70 MiB/s 499.78 MiB/s 499.87 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
crc32/nolookup          time:   [131.55 µs 131.57 µs 131.60 µs]
                        thrpt:  [118.73 MiB/s 118.76 MiB/s 118.78 MiB/s]
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe
crc32/bytewise          time:   [31.266 µs 31.271 µs 31.276 µs]
                        thrpt:  [499.59 MiB/s 499.67 MiB/s 499.74 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
crc32/slice16           time:   [4.6168 µs 4.6195 µs 4.6223 µs]
                        thrpt:  [3.3011 GiB/s 3.3031 GiB/s 3.3050 GiB/s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
crc32/simd              time:   [983.88 ns 985.53 ns 987.60 ns]
                        thrpt:  [15.450 GiB/s 15.483 GiB/s 15.509 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

It's about 4 times faster then Slice16, it's implementable for all algorithms, and theoretically requires no table (the remaining bytes could use nolookup). I know that crc32fast exists, but its not configurable, manually calculating the constants is annoying, and this implementation is about 1GB/s faster (crc32fast uses unaligned memory access)

SIMD will only be used when Crc<Simd<W>> is used, and supported by the target-features specified when compiling.

TODO

[ ] 8 Normal domain
[ ] 16 Normal domain
[ ] 32 Normal domain
[ ] 64 Normal domain
[x] 8 Bit-reflected domain
[x] 16 Bit-reflected domain
[x] 32 Bit-reflected domain
[ ] 64 Bit-reflected domain
[x] x86
[ ] arm (NEON)

valaphee commented 5 months ago

I also combined the tests, to check all variants directly in correctness, especially for different widths, as they all use the same SIMD implementation.

I would postpone the other implementations to a later PR.

akhilles commented 5 months ago

Sorry for the churn, but I just merged the new Implementation API (https://github.com/mrhooray/crc-rs/pull/115). There shouldn't bee too many conflicts with this PR, but I can help rebase it if you'd like.

valaphee commented 5 months ago

Ah nice, no problem, I'll take a look today, with the feedback given

valaphee commented 5 months ago

I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd?

And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn.

valaphee commented 5 months ago

Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl?

akhilles commented 5 months ago

I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd?

I prefer Simd.

And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn.

Ah, good point. Then, I think not touching the default behavior is fine.

akhilles commented 5 months ago

Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl?

I think we should gate the Simd impl behind the platform/target-features and leave the default as is.

Once runtime detection is added, this can be revisited.

valaphee commented 5 months ago

Yep, Simd is probably easier to understand.

Open for future PRs would be:

ARM (pmull2) support
CRC1..32 non-reflex support
CRC33..128 support
Runtime feature detection (requires std or https://github.com/RustCrypto/utils/tree/master/cpufeatures)

can all be done in a non-breaking way.

mrhooray / crc-rs

SIMD-based implementations #109