riscv64: Implement optimised crc using zbc and zbb extensions

The RISC-V carryless-multiplication extension, Zbc, provides instructions that can be used to optimise the calculation of Cyclic Redundancy Checks (CRCs). This pull request creates a new RISC-V target for isa-l and provides optimised implementations of all the CRC16, CRC32 and CRC64 algorithms using these instructions, based on the approach described in Intel's whitepaper on the topic, "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction". The core loop, which folds four 128-bit chunks in parallel, is shared between all the algorithms.

This patch also requires the target have the Zbb bit-manipulation extension. This provides an endianness swap hardware instruction, which makes up a fair part of the core folding loop for non-reflected CRCs.

On a MuseBook (1.6 GHz Spacemit X60), I gathered the following performance numbers, observing around a 20x increase in throughput for reflected algorithms and 17x for normal algorithms, likely due to the extra endianness swap instructions needed.

Algorithm	Throughput (MB/s)
Table (Base)	206
CRC16_t10dif_copy	463
CRC16_t10dif	3855
CRC32_gzip_refl	4530
CRC32_IEEE	3855
CRC32_iscsi	4530
CRC64_norm	3856
CRC64_refl	4538

This patch doesn't currently have functionality for picking which version to use at runtime like the CRC implementations for aarch64 and x86_64 do. The approach used by them (reading either cpuid or hwcap) doesn't immediately translate to RISCV; I have some ideas for alternate routes, either using the linux riscv hwprobe interface which would require an up-to-date version of the kernel (v6.4+), or by detecting at buildtime with compiler flags (gcc/clang only and doesn't help detect at runtime). It would be great to get your opinion on which approach would be preferred.

intel / isa-l

riscv64: Implement optimised crc using zbc and zbb extensions #299