mrhooray / crc-rs

Rust implementation of CRC(16, 32, 64) with support of various standards
Apache License 2.0
187 stars 49 forks source link

SIMD-based implementations #109

Open valaphee opened 10 months ago

valaphee commented 10 months ago

The x86 implementation is based on Intel's paper about "Fast CRC Computation for Generic Polynomials using PCLMULQDQ Instruction"

baseline/baseline       time:   [563.00 ns 564.20 ns 565.65 ns]
                        thrpt:  [26.976 GiB/s 27.045 GiB/s 27.103 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
crc32/default           time:   [31.258 µs 31.264 µs 31.269 µs]
                        thrpt:  [499.70 MiB/s 499.78 MiB/s 499.87 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
crc32/nolookup          time:   [131.55 µs 131.57 µs 131.60 µs]
                        thrpt:  [118.73 MiB/s 118.76 MiB/s 118.78 MiB/s]
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe
crc32/bytewise          time:   [31.266 µs 31.271 µs 31.276 µs]
                        thrpt:  [499.59 MiB/s 499.67 MiB/s 499.74 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
crc32/slice16           time:   [4.6168 µs 4.6195 µs 4.6223 µs]
                        thrpt:  [3.3011 GiB/s 3.3031 GiB/s 3.3050 GiB/s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
crc32/simd              time:   [983.88 ns 985.53 ns 987.60 ns]
                        thrpt:  [15.450 GiB/s 15.483 GiB/s 15.509 GiB/s]
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) high mild
  6 (6.00%) high severe

It's about 4 times faster then Slice16, it's implementable for all algorithms, and theoretically requires no table (the remaining bytes could use nolookup). I know that crc32fast exists, but its not configurable, manually calculating the constants is annoying, and this implementation is about 1GB/s faster (crc32fast uses unaligned memory access)

SIMD will only be used when Crc<Simd<W>> is used, and supported by the target-features specified when compiling.

TODO

valaphee commented 5 months ago

I also combined the tests, to check all variants directly in correctness, especially for different widths, as they all use the same SIMD implementation.

I would postpone the other implementations to a later PR.

akhilles commented 5 months ago

Sorry for the churn, but I just merged the new Implementation API (https://github.com/mrhooray/crc-rs/pull/115). There shouldn't bee too many conflicts with this PR, but I can help rebase it if you'd like.

valaphee commented 5 months ago

Ah nice, no problem, I'll take a look today, with the feedback given

valaphee commented 5 months ago

I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd?

And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn.

valaphee commented 5 months ago

Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl?

akhilles commented 5 months ago

I thought about renaming Simd to Clmul (carry-less multiplication variant), or should I stay with Simd?

I prefer Simd.

And I would recommend against using Simd as default, as the problem with Simd is, that it's not possible at the moment, to do the crc calculation in a const fn.

Ah, good point. Then, I think not touching the default behavior is fine.

akhilles commented 5 months ago

Should Simd/Clmul always be available even if the platform/target-features doesn't support the required features and then just be handled as a type alias for the default impl?

I think we should gate the Simd impl behind the platform/target-features and leave the default as is.

Once runtime detection is added, this can be revisited.

valaphee commented 5 months ago

Yep, Simd is probably easier to understand.

Open for future PRs would be:

can all be done in a non-breaking way.