Closed edelsohn closed 2 years ago
I think that I'm an expert of x86-64 optimization and mcl is the fastest implementation of BLS12-381 pairing now. It took a long time to get the score.
The asm generated by LLVM seems good performance as the followings.
- | mcl on i7-8700 @ 3.2GHz | mcl on s390x |
---|---|---|
xbyak | 2.323clk(0.407) | ? |
asm generated by LLVM + GMP | 3.847clk(0.675) | 1.966msec(0.61) |
GMP | 5.696clk(1.00) | 3.222msec(1.00) |
C | 8.659clk(1.52) | 4.738msec(1.47) |
on x64
sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"
x86-64 supports special mnemonics such as mulx, adcx, adox for multiple length operation. Does s390x support them? If not, then I expect the hand-optimized code will get at most 1.2~1.4 times speed or so. The optimization for s390x is an interesting subject but I don't have time to do it.
cf. The score of https://github.com/supranational/blst/commit/dc79d429fa4c63a53f4b1f8cb01d90cb9c2eccf0 is 2.426clk on the same x64 env.
Thank you ! I glimpsed over the numbers you posted here based on the tests I did for the other bls library and those looked slower so I wrongly assumed this would be slower too.
@e-desouza The result on 6 vCPU is almost the same of my above result. Does the 16 vCPU have the same CPU as 6 vCPUs? What is the result of pairing the following command on 16 vCPUs?
make bin/bls12_test.exe MCL_USE_GMP=1 && bin/bls12_test.exe
Generate optimized assembly language implementations of functions required by mcl and demonstrate equivalent relative speedup to x86 and AArch64 implementations. A financial bounty is available.