Benchmark vs MCL on x86

mratsim commented 4 years ago

How to reproduce

MCL

git clone https://github.com/herumi/mcl
cd mcl

make -j ${ncpu}
make bin/bls12_test.exe # even on Linux

bin/bls12_test.exe

nim-blscurve

git clone https://github.com/status-im/nim-blscurve
cd nim-blscurve
nimble bench

Results

On i9-9980XE. Note: Overclocked at 4.1GHz while nominal clock is 3.0 GHz so the cycle count is off by a factor 4.1/3.0

Reading results:

G2 multiplications is sign
pairing is composed among others of Miller Loop and Final Exponentiation pairing is verify

MCL using JIT (x86-only)

Highlighted the important parts

$  bin/bls12_test.exe
JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      198.538Kclk <-------------------------
G1::mul        207.139Kclk <-------------------------
G1::add          1.252Kclk <-------------------------
G1::dbl        880.09 clk
G2::mulCT      383.361Kclk <-------------------------
G2::mul        407.538Kclk <-------------------------
G2::add          3.424Kclk <-------------------------
G2::dbl          2.169Kclk
GT::pow        802.285Kclk
G1::setStr chk 289.740Kclk
G1::setStr       2.606Kclk
G2::setStr chk 723.913Kclk
G2::setStr       5.467Kclk
hashAndMapToG1 225.867Kclk
hashAndMapToG2 467.456Kclk
Fp::add         14.27 clk
Fp::sub          9.28 clk
Fp::neg          8.05 clk
Fp::mul        107.01 clk
Fp::sqr         98.88 clk
Fp::inv          4.542Kclk
Fp2::add        19.23 clk
Fp2::sub        14.30 clk
Fp2::neg        13.90 clk
Fp2::mul       305.68 clk
Fp2::mul_xi     22.51 clk
Fp2::sqr       228.53 clk
Fp2::inv         5.111Kclk
FpDbl::addPre   13.24 clk
FpDbl::subPre   13.21 clk
FpDbl::add      19.52 clk
FpDbl::sub      12.88 clk
FpDbl::mulPre   46.70 clk
FpDbl::sqrPre   40.28 clk
FpDbl::mod      58.60 clk
Fp2Dbl::mulPre  203.47 clk
Fp2Dbl::sqrPre  113.19 clk
GT::add        114.57 clk
GT::mul          7.663Kclk
GT::sqr          5.490Kclk
GT::inv         16.616Kclk
FpDbl::mulPre   46.79 clk
pairing          2.237Mclk <-------------------------
millerLoop     906.605Kclk <-------------------------
finalExp         1.329Mclk <-------------------------
precomputeG2   201.104Kclk
precomputedML  686.266Kclk
millerLoopVec    4.616Mclk
ctest:module=finalExp
finalExp   1.338Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  32.201Kclk
naiveG2  17.132Kclk
calcBN2  64.564Kclk
naiveG2  46.680Kclk
BLS12_381
calcBN1  77.113Kclk
naiveG1  56.032Kclk
calcBN2 157.989Kclk
naiveG2 129.702Kclk
ctest:module=eth2
mapToG2  org-cofactor 904.585Kclk
mapToG2 fast-cofactor 476.747Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 345.128Kclk
deserializeG2 842.046Kclk
verifyOrder(0)
deserializeG1  57.978Kclk
deserializeG2 121.257Kclk
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0

MCL using Assembly from LLVM i256 and i384 (x86 and ARM)

$  bin/bls12_test.exe -m llvm_mont
JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      208.328Kclk <-------------------------
G1::mul        212.953Kclk <-------------------------
G1::add          1.279Kclk <-------------------------
G1::dbl        919.59 clk
G2::mulCT      522.604Kclk <-------------------------
G2::mul        537.523Kclk <-------------------------
G2::add          4.638Kclk <-------------------------
G2::dbl          2.704Kclk
GT::pow        916.182Kclk
G1::setStr chk 301.920Kclk
G1::setStr       2.684Kclk
G2::setStr chk 935.164Kclk
G2::setStr       5.723Kclk
hashAndMapToG1 239.817Kclk
hashAndMapToG2 582.365Kclk
Fp::add         13.37 clk
Fp::sub         13.86 clk
Fp::neg          8.78 clk
Fp::mul        110.30 clk
Fp::sqr        110.73 clk
Fp::inv          4.594Kclk
Fp2::add        25.68 clk
Fp2::sub        27.10 clk
Fp2::neg        19.76 clk
Fp2::mul       453.66 clk
Fp2::mul_xi     35.83 clk
Fp2::sqr       254.18 clk
Fp2::inv         5.168Kclk
FpDbl::addPre   14.72 clk
FpDbl::subPre   14.64 clk
FpDbl::add      19.02 clk
FpDbl::sub      17.55 clk
FpDbl::mulPre   54.56 clk
FpDbl::sqrPre   51.65 clk
FpDbl::mod      80.36 clk
Fp2Dbl::mulPre  243.72 clk
Fp2Dbl::sqrPre  145.11 clk
GT::add        168.85 clk
GT::mul          8.658Kclk
GT::sqr          6.252Kclk
GT::inv         20.416Kclk
FpDbl::mulPre   53.03 clk
pairing          2.717Mclk <-------------------------
millerLoop       1.105Mclk <-------------------------
finalExp         1.550Mclk <-------------------------
precomputeG2   269.778Kclk
precomputedML  852.854Kclk
millerLoopVec    5.598Mclk
ctest:module=finalExp
finalExp   1.547Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  34.282Kclk
naiveG2  17.954Kclk
calcBN2  65.749Kclk
naiveG2  48.280Kclk
BLS12_381
calcBN1  82.024Kclk
naiveG1  60.722Kclk
calcBN2 167.170Kclk
naiveG2 139.621Kclk
ctest:module=eth2
mapToG2  org-cofactor   1.150Mclk
mapToG2 fast-cofactor 585.033Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 366.132Kclk
deserializeG2   1.152Mclk
verifyOrder(0)
deserializeG1  62.822Kclk
deserializeG2 130.251Kclk
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0

Nim-blscurve using Milagro

Warmup: 0.9038 s, result 224 (displayed to avoid compiler optimizing warmup away)

Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz

⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=================================================================================================================

Scalar multiplication G1                                   2399.307 ops/s       416787 ns/op      1250378 cycles
Scalar multiplication G2                                    890.692 ops/s      1122722 ns/op      3368208 cycles
EC add G1                                                911577.028 ops/s         1097 ns/op         3291 cycles
EC add G2                                                323310.702 ops/s         3093 ns/op         9281 cycles
Pairing (Milagro builtin double pairing)                    420.039 ops/s      2380729 ns/op      7142276 cycles
Pairing (Multi-Pairing with delayed Miller and Exp)         417.711 ops/s      2393999 ns/op      7182089 cycles

⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
           This is an outdated draft.

Hash to G2 (Draft #5)                                       937.034 ops/s      1067197 ns/op      3201632 cycles

Conclusion

(Our cycles and MCL clocks/clk are the same unit.)

Scalar Multiplication G2 / Signing is about 8x slower Pairing / Verification is about 3x slower

mratsim commented 4 years ago

Explanation on scalar multiplication slowness:

The order r is of bit width 255, but the benchmark is using 381 bits https://github.com/status-im/nim-blscurve/blob/1a18d0dbea6f1f00409c660be58eb5206060cb3c/benchmarks/bls12381_curve.nim#L27-L41 This means 50% more operations

mratsim commented 4 years ago

Actually, looking in the code, scalar mul uses nbits which does check (and leaks) the number of exponent bits:

https://github.com/status-im/nim-blscurve/blob/e2ddcc468f0cbf86b912364eaf46b74b0bd1ff27/blscurve/csources/64/big_384_58.c#L1040-L1057 https://github.com/status-im/nim-blscurve/blob/1a18d0dbea6f1f00409c660be58eb5206060cb3c/blscurve/csources/64/ecp_BLS381.c#L1100-L1110

However there is a special path in pair_BLS381.c to use endomorphism acceleration if activated https://github.com/status-im/nim-blscurve/blob/1a18d0dbea6f1f00409c660be58eb5206060cb3c/blscurve/csources/64/pair_BLS381.c#L763-L810 which should be benchmarked as well and probably be used instead of the generic ecp_mul

mratsim commented 4 years ago

Updated MCL perf as of https://github.com/herumi/mcl/tree/b390d6d4ffded57727ff752fb0209e0d397ed946 (sept 21)

JIT 1
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      224.863Kclk
G1::mul        207.081Kclk
G1::add          1.225Kclk
G1::dbl        900.69 clk
G2::mulCT      464.260Kclk
G2::mul        402.097Kclk
G2::add          3.414Kclk
G2::dbl          2.120Kclk
GT::pow        684.362Kclk
G1::setStr chk 289.395Kclk
G1::setStr       2.485Kclk
G2::setStr chk 702.704Kclk
G2::setStr       5.159Kclk
hashAndMapToG1 221.856Kclk
hashAndMapToG2 453.844Kclk
Fp::add         13.62 clk
Fp::sub          9.28 clk
Fp::neg          8.05 clk
Fp::mul        101.33 clk
Fp::sqr         98.98 clk
Fp::inv          4.536Kclk
Fp::pow         52.610Kclk
Fr::add         10.24 clk
Fr::sub          9.57 clk
Fr::neg          5.88 clk
Fr::mul         53.96 clk
Fr::sqr         54.48 clk
Fr::inv          1.895Kclk
Fr::pow         19.889Kclk
Fp2::add        21.71 clk
Fp2::sub        16.92 clk
Fp2::neg        14.85 clk
Fp2::mul       310.27 clk
Fp2::mul_xi     29.50 clk
Fp2::sqr       225.28 clk
Fp2::inv         5.093Kclk
FpDbl::addPre    9.66 clk
FpDbl::subPre    9.60 clk
FpDbl::add      15.96 clk
FpDbl::sub      10.69 clk
FpDbl::mulPre   46.81 clk
FpDbl::sqrPre   40.39 clk
FpDbl::mod      58.99 clk
Fp2Dbl::mulPre  187.35 clk
Fp2Dbl::sqrPre  111.38 clk
GT::add        114.84 clk
GT::mul          6.364Kclk
GT::sqr          4.603Kclk
GT::inv         15.891Kclk
FpDbl::mulPre   46.84 clk
pairing          1.980Mclk
millerLoop     841.630Kclk
finalExp         1.115Mclk
precomputeG2   202.196Kclk
precomputedML  613.377Kclk
millerLoopVec    4.277Mclk
ctest:module=finalExp
finalExp   1.103Mclk
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1  32.363Kclk
naiveG2  17.237Kclk
calcBN2  64.593Kclk
naiveG2  46.927Kclk
BLS12_381
calcBN1  76.468Kclk
naiveG1  55.494Kclk
calcBN2 156.801Kclk
naiveG2 128.367Kclk
ctest:module=deserialize
verifyOrder(1)
deserializeG1 346.770Kclk
deserializeG2 818.290Kclk
verifyOrder(0)
deserializeG1  59.815Kclk
deserializeG2 119.463Kclk
ctest:module=verifyG1
ctest:module=verifyG2
ctest:name=bls12_test, module=9, total=3727, ok=3727, ng=0, exception=0

Pairing has been accelerated by 14%

status-im / nim-blscurve