intel / isa-l

Intelligent Storage Acceleration Library
Other
946 stars 300 forks source link

riscv64: Implement optimised crc using zbc and zbb extensions #299

Open daniel-gregory opened 1 month ago

daniel-gregory commented 1 month ago

The RISC-V carryless-multiplication extension, Zbc, provides instructions that can be used to optimise the calculation of Cyclic Redundancy Checks (CRCs). This pull request creates a new RISC-V target for isa-l and provides optimised implementations of all the CRC16, CRC32 and CRC64 algorithms using these instructions, based on the approach described in Intel's whitepaper on the topic, "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction". The core loop, which folds four 128-bit chunks in parallel, is shared between all the algorithms.

This patch also requires the target have the Zbb bit-manipulation extension. This provides an endianness swap hardware instruction, which makes up a fair part of the core folding loop for non-reflected CRCs.

On a MuseBook (1.6 GHz Spacemit X60), I gathered the following performance numbers, observing around a 20x increase in throughput for reflected algorithms and 17x for normal algorithms, likely due to the extra endianness swap instructions needed.

Algorithm Throughput (MB/s)
Table (Base) 206
CRC16_t10dif_copy 463
CRC16_t10dif 3855
CRC32_gzip_refl 4530
CRC32_IEEE 3855
CRC32_iscsi 4530
CRC64_norm 3856
CRC64_refl 4538

This patch doesn't currently have functionality for picking which version to use at runtime like the CRC implementations for aarch64 and x86_64 do. The approach used by them (reading either cpuid or hwcap) doesn't immediately translate to RISCV; I have some ideas for alternate routes, either using the linux riscv hwprobe interface which would require an up-to-date version of the kernel (v6.4+), or by detecting at buildtime with compiler flags (gcc/clang only and doesn't help detect at runtime). It would be great to get your opinion on which approach would be preferred.

pablodelara commented 1 day ago

Thanks @daniel-gregory! We decide the implementation to use at runtime, so it would be great to do the same here too, thanks!