It would be nice to have byteswap instructions. They are simple to implement in hardware and ZFS would benefit from it.
When ZFS imports a pool from a system with opposite endianness, it must byteswap data when verifying checksums. The ZFS code has many versions of checksumming functions, which are native endian vs byteswapped, scalar vs simd vs superscalar, and arm vs Intel vs etcetera. The superscalar one is a scalar implementation of the algorithm that enables us to do SIMD. Intel did a write up explaining the math behind how it works:
When SIMD is not available, the superscalar version is always the fastest, but on architectures without byteswap instructions, performance of the byteswap version of the checksum function can be expected to suffer severely.
We generate so many versions of checksumming functions that I have been experimenting with using high level GNU C vector code to generate them all. The following has LLVM/Ciang compile the high level GNU C vector code for RV64GC:
The first function is the function for calculating partial sums for native endian checksums while the second function is the byteswap version. The native endian checksum’s loop only uses 22 instructions. The byteswap version’s loop uses 70 instructions. It will likely operate around 1/3 of the speed at best. Byteswap instructions would restore most of the speed of the native endian version.
It would be nice to have byteswap instructions. They are simple to implement in hardware and ZFS would benefit from it.
When ZFS imports a pool from a system with opposite endianness, it must byteswap data when verifying checksums. The ZFS code has many versions of checksumming functions, which are native endian vs byteswapped, scalar vs simd vs superscalar, and arm vs Intel vs etcetera. The superscalar one is a scalar implementation of the algorithm that enables us to do SIMD. Intel did a write up explaining the math behind how it works:
https://www.intel.com/content/www/us/en/developer/articles/technical/fast-computation-of-fletcher-checksums.html
When SIMD is not available, the superscalar version is always the fastest, but on architectures without byteswap instructions, performance of the byteswap version of the checksum function can be expected to suffer severely.
We generate so many versions of checksumming functions that I have been experimenting with using high level GNU C vector code to generate them all. The following has LLVM/Ciang compile the high level GNU C vector code for RV64GC:
https://gcc.godbolt.org/z/xdedG1Tre
The first function is the function for calculating partial sums for native endian checksums while the second function is the byteswap version. The native endian checksum’s loop only uses 22 instructions. The byteswap version’s loop uses 70 instructions. It will likely operate around 1/3 of the speed at best. Byteswap instructions would restore most of the speed of the native endian version.