herumi / mcl

a portable and fast pairing-based cryptography library
BSD 3-Clause "New" or "Revised" License
450 stars 152 forks source link

[$] Optimize BN and BLS12_381 for s390x #123

Closed edelsohn closed 2 years ago

edelsohn commented 3 years ago

Generate optimized assembly language implementations of functions required by mcl and demonstrate equivalent relative speedup to x86 and AArch64 implementations. A financial bounty is available.

e-desouza commented 3 years ago

Hi @herumi , would you be interested in the bounty?

herumi commented 3 years ago

I think that I'm an expert of x86-64 optimization and mcl is the fastest implementation of BLS12-381 pairing now. It took a long time to get the score.

The asm generated by LLVM seems good performance as the followings.

- mcl on i7-8700 @ 3.2GHz mcl on s390x
xbyak 2.323clk(0.407) ?
asm generated by LLVM + GMP 3.847clk(0.675) 1.966msec(0.61)
GMP 5.696clk(1.00) 3.222msec(1.00)
C 8.659clk(1.52) 4.738msec(1.47)
on x64
sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

x86-64 supports special mnemonics such as mulx, adcx, adox for multiple length operation. Does s390x support them? If not, then I expect the hand-optimized code will get at most 1.2~1.4 times speed or so. The optimization for s390x is an interesting subject but I don't have time to do it.

cf. The score of https://github.com/supranational/blst/commit/dc79d429fa4c63a53f4b1f8cb01d90cb9c2eccf0 is 2.426clk on the same x64 env.

e-desouza commented 3 years ago

Thank you ! I glimpsed over the numbers you posted here based on the tests I did for the other bls library and those looked slower so I wrongly assumed this would be slower too.

herumi commented 3 years ago

@e-desouza The result on 6 vCPU is almost the same of my above result. Does the 16 vCPU have the same CPU as 6 vCPUs? What is the result of pairing the following command on 16 vCPUs?

make bin/bls12_test.exe MCL_USE_GMP=1 && bin/bls12_test.exe