[$] Optimize BN and BLS12_381 for s390x

edelsohn commented 3 years ago

Generate optimized assembly language implementations of functions required by mcl and demonstrate equivalent relative speedup to x86 and AArch64 implementations. A financial bounty is available.

e-desouza commented 3 years ago

Hi @herumi , would you be interested in the bounty?

herumi commented 3 years ago

I think that I'm an expert of x86-64 optimization and mcl is the fastest implementation of BLS12-381 pairing now. It took a long time to get the score.

The asm generated by LLVM seems good performance as the followings.

-	mcl on i7-8700 @ 3.2GHz	mcl on s390x
xbyak	2.323clk(0.407)	?
asm generated by LLVM + GMP	3.847clk(0.675)	1.966msec(0.61)
GMP	5.696clk(1.00)	3.222msec(1.00)
C	8.659clk(1.52)	4.738msec(1.47)

on x64
sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo"

x86-64 supports special mnemonics such as mulx, adcx, adox for multiple length operation. Does s390x support them? If not, then I expect the hand-optimized code will get at most 1.2~1.4 times speed or so. The optimization for s390x is an interesting subject but I don't have time to do it.

cf. The score of https://github.com/supranational/blst/commit/dc79d429fa4c63a53f4b1f8cb01d90cb9c2eccf0 is 2.426clk on the same x64 env.

e-desouza commented 3 years ago

Thank you ! I glimpsed over the numbers you posted here based on the tests I did for the other bls library and those looked slower so I wrongly assumed this would be slower too.

herumi commented 3 years ago

@e-desouza The result on 6 vCPU is almost the same of my above result. Does the 16 vCPU have the same CPU as 6 vCPUs? What is the result of pairing the following command on 16 vCPUs?

make bin/bls12_test.exe MCL_USE_GMP=1 && bin/bls12_test.exe

herumi / mcl

[$] Optimize BN and BLS12_381 for s390x #123