Closed mratsim closed 4 years ago
Explanation on scalar multiplication slowness:
order
r is of bit width 255, but the benchmark is using 381 bits
https://github.com/status-im/nim-blscurve/blob/1a18d0dbea6f1f00409c660be58eb5206060cb3c/benchmarks/bls12381_curve.nim#L27-L41
This means 50% more operationsActually, looking in the code, scalar mul uses nbits
which does check (and leaks) the number of exponent bits:
https://github.com/status-im/nim-blscurve/blob/e2ddcc468f0cbf86b912364eaf46b74b0bd1ff27/blscurve/csources/64/big_384_58.c#L1040-L1057 https://github.com/status-im/nim-blscurve/blob/1a18d0dbea6f1f00409c660be58eb5206060cb3c/blscurve/csources/64/ecp_BLS381.c#L1100-L1110
However there is a special path in pair_BLS381.c
to use endomorphism acceleration if activated
https://github.com/status-im/nim-blscurve/blob/1a18d0dbea6f1f00409c660be58eb5206060cb3c/blscurve/csources/64/pair_BLS381.c#L763-L810 which should be benchmarked as well
and probably be used instead of the generic ecp_mul
Updated MCL perf as of https://github.com/herumi/mcl/tree/b390d6d4ffded57727ff752fb0209e0d397ed946 (sept 21)
JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT 224.863Kclk
G1::mul 207.081Kclk
G1::add 1.225Kclk
G1::dbl 900.69 clk
G2::mulCT 464.260Kclk
G2::mul 402.097Kclk
G2::add 3.414Kclk
G2::dbl 2.120Kclk
GT::pow 684.362Kclk
G1::setStr chk 289.395Kclk
G1::setStr 2.485Kclk
G2::setStr chk 702.704Kclk
G2::setStr 5.159Kclk
hashAndMapToG1 221.856Kclk
hashAndMapToG2 453.844Kclk
Fp::add 13.62 clk
Fp::sub 9.28 clk
Fp::neg 8.05 clk
Fp::mul 101.33 clk
Fp::sqr 98.98 clk
Fp::inv 4.536Kclk
Fp::pow 52.610Kclk
Fr::add 10.24 clk
Fr::sub 9.57 clk
Fr::neg 5.88 clk
Fr::mul 53.96 clk
Fr::sqr 54.48 clk
Fr::inv 1.895Kclk
Fr::pow 19.889Kclk
Fp2::add 21.71 clk
Fp2::sub 16.92 clk
Fp2::neg 14.85 clk
Fp2::mul 310.27 clk
Fp2::mul_xi 29.50 clk
Fp2::sqr 225.28 clk
Fp2::inv 5.093Kclk
FpDbl::addPre 9.66 clk
FpDbl::subPre 9.60 clk
FpDbl::add 15.96 clk
FpDbl::sub 10.69 clk
FpDbl::mulPre 46.81 clk
FpDbl::sqrPre 40.39 clk
FpDbl::mod 58.99 clk
Fp2Dbl::mulPre 187.35 clk
Fp2Dbl::sqrPre 111.38 clk
GT::add 114.84 clk
GT::mul 6.364Kclk
GT::sqr 4.603Kclk
GT::inv 15.891Kclk
FpDbl::mulPre 46.84 clk
pairing 1.980Mclk
millerLoop 841.630Kclk
finalExp 1.115Mclk
precomputeG2 202.196Kclk
precomputedML 613.377Kclk
millerLoopVec 4.277Mclk
ctest:module=finalExp
finalExp 1.103Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1 32.363Kclk
naiveG2 17.237Kclk
calcBN2 64.593Kclk
naiveG2 46.927Kclk
BLS12_381
calcBN1 76.468Kclk
naiveG1 55.494Kclk
calcBN2 156.801Kclk
naiveG2 128.367Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 346.770Kclk
deserializeG2 818.290Kclk
verifyOrder(0)
deserializeG1 59.815Kclk
deserializeG2 119.463Kclk
ctest:module=verifyG1
ctest:module=verifyG2
ctest:name=bls12_test, module=9, total=3727, ok=3727, ng=0, exception=0
Pairing has been accelerated by 14%
How to reproduce
MCL
nim-blscurve
Results
On i9-9980XE. Note: Overclocked at 4.1GHz while nominal clock is 3.0 GHz so the cycle count is off by a factor 4.1/3.0
Reading results:
sign
pairing
is composed among others of Miller Loop and Final Exponentiationpairing
isverify
MCL using JIT (x86-only)
Highlighted the important parts
MCL using Assembly from LLVM i256 and i384 (x86 and ARM)
Nim-blscurve using Milagro
Conclusion
(Our cycles and MCL clocks/clk are the same unit.)
Scalar Multiplication G2 / Signing is about 8x slower Pairing / Verification is about 3x slower