Benchmarks vs MCL on ARM

mratsim commented 4 years ago

For reference, this is MCL speed on ARM-32 Rpi 4

https://github.com/mratsim/mcl/blob/2b318e84/bench-arm32-pi4.log

JIT 0
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT        1.932msec
G1::mul          1.976msec
G1::add         11.765usec
G1::dbl          8.713usec
G2::mulCT        4.389msec
G2::mul          4.527msec
G2::add         39.784usec
G2::dbl         21.466usec
GT::pow          7.413msec
G1::setStr chk   2.834msec
G1::setStr      12.919usec
G2::setStr chk   7.554msec
G2::setStr      26.452usec
hashAndMapToG1   2.722msec
hashAndMapToG2   5.766msec
Fp::add         47.45nsec
Fp::sub         56.04nsec
Fp::neg         29.36nsec
Fp::mul        894.09nsec
Fp::sqr          1.371usec
Fp::inv        121.102usec
Fp2::add        93.46nsec
Fp2::sub       109.45nsec
Fp2::neg        58.77nsec
Fp2::mul         4.102usec
Fp2::mul_xi    116.80nsec
Fp2::sqr         1.948usec
Fp2::inv       129.694usec
FpDbl::addPre   83.59nsec
FpDbl::subPre   84.77nsec
FpDbl::add      83.48nsec
FpDbl::sub      86.79nsec
FpDbl::mulPre  888.74nsec
FpDbl::sqrPre  846.63nsec
FpDbl::mod     525.81nsec
Fp2Dbl::mulPre    3.015usec
Fp2Dbl::sqrPre    1.910usec
GT::add        570.59nsec
GT::mul         71.228usec
GT::sqr         49.861usec
GT::inv        258.619usec
FpDbl::mulPre  888.75nsec
pairing         21.020msec
millerLoop       9.109msec
finalExp        11.902msec
precomputeG2     2.191msec
precomputedML    6.881msec
millerLoopVec   47.449msec
ctest:module=finalExp
finalExp  11.990msec
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1 499.901usec
naiveG2 214.120usec
calcBN2 981.032usec
naiveG2 725.049usec
BLS12_381
calcBN1   1.122msec
naiveG1 690.237usec
calcBN2   2.271msec
naiveG2   1.818msec
ctest:module=eth2
mapToG2  org-cofactor  10.949msec
mapToG2 fast-cofactor   6.192msec
ctest:name=bls12_test, module=7, total=832, ok=832, ng=0, exception=0

Reproduction command:

git clone git@github.com:herumi/mcl
cd mcl
make bin/bls12_test.exe MCL_USE_GMP=0 MCL_USE_OPENSSL=0
bin/bls12_test.exe

mratsim commented 4 years ago

On the same Raspberry Pi 4b on ARM Broadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz

Note: I am using 32-bit Raspbian so Milagro was compiled with 32-bit limbs, you can expect 2x more perf in 64-bit mode:

Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true                                          
⚠️ Warning: using Milagro with 32-bit limbs

=================================================================================================================    

Scalar multiplication G1                                    194.349 ops/s      5145377 ns/op                         
Scalar multiplication G2                                     71.983 ops/s     13892224 ns/op                         
EC add G1                                                 75746.099 ops/s        13202 ns/op                         
EC add G2                                                 25377.490 ops/s        39405 ns/op                         
Pairing (Milagro builtin double pairing)                     38.146 ops/s     26214868 ns/op                         
Pairing (Multi-Pairing with delayed Miller and Exp)          38.008 ops/s     26310474 ns/op                         

⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).                                                        
           This is an outdated draft.

Hash to G2 (Draft #5)                                        75.663 ops/s     13216584 ns/op

mratsim commented 4 years ago

On a Huawei P20 Lite phone (2018 entry/midrange phone) Processor HiSilicon Kirin 659 2360MHz (ARM v8)

Milagro in 64-bit mode

Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs

=================================================================================================================

Scalar multiplication G1                                    284.394 ops/s      3516247 ns/op
Scalar multiplication G2                                    102.853 ops/s      9722601 ns/op
EC add G1                                                105864.916 ops/s         9446 ns/op
EC add G2                                                 36911.265 ops/s        27092 ns/op
Pairing (Milagro builtin double pairing)                     50.957 ops/s     19624477 ns/op
Pairing (Multi-Pairing with delayed Miller and Exp)          50.068 ops/s     19972640 ns/op

⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
           This is an outdated draft.

Hash to G2 (Draft #5)                                       117.156 ops/s      8535621 ns/op

mratsim commented 4 years ago

MCL on the same phone, 64-bit as well

EC Add G2 is 2x faster than Milagro Scalar multiplication is 2x faster than Milagro Pairing is 2.5x faster than Milagro

JIT 0
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT      741.964usec
G1::mul        760.436usec
G1::add          4.797usec
G1::dbl          3.202usec
G2::mulCT        1.627msec
G2::mul          1.675msec
G2::add         14.409usec
G2::dbl          8.494usec
GT::pow          2.598msec
G1::setStr chk   1.078msec
G1::setStr      11.047usec
G2::setStr chk   2.926msec
G2::setStr      22.819usec
hashAndMapToG1   1.373msec
hashAndMapToG2   2.767msec
Fp::add         23.30nsec
Fp::sub         22.19nsec
Fp::neg         18.86nsec
Fp::mul        410.98nsec
Fp::sqr        412.34nsec
Fp::inv        147.942usec
Fp2::add        47.42nsec
Fp2::sub        45.46nsec
Fp2::neg        34.94nsec
Fp2::mul         1.395usec
Fp2::mul_xi     70.49nsec
Fp2::sqr       917.64nsec
Fp2::inv       151.215usec
FpDbl::addPre   41.34nsec
FpDbl::subPre   43.81nsec
FpDbl::add      44.43nsec
FpDbl::sub      42.14nsec
FpDbl::mulPre  243.68nsec
FpDbl::sqrPre  159.24nsec
FpDbl::mod     228.10nsec
Fp2Dbl::mulPre  912.43nsec
Fp2Dbl::sqrPre  552.47nsec
GT::add        293.75nsec
GT::mul         24.773usec
GT::sqr         17.505usec
GT::inv        202.780usec
FpDbl::mulPre  243.44nsec
pairing          7.627msec
millerLoop       3.251msec
finalExp         4.361msec
precomputeG2   823.918usec
precomputedML    2.424msec
millerLoopVec   18.560msec
ctest:module=finalExp
finalExp   4.355msec
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1 386.328usec
naiveG2  75.901usec
calcBN2 730.271usec
naiveG2 456.292usec
BLS12_381
calcBN1 695.115usec
naiveG1 243.682usec
calcBN2   1.440msec
naiveG2 935.823usec
ctest:module=eth2
mapToG2  org-cofactor   4.951msec
mapToG2 fast-cofactor   3.194msec               ctest:module=deserialize                        verifyOrder(1)                                  deserializeG1   1.500msec
deserializeG2   3.913msec                       verifyOrder(0)
deserializeG1 432.423usec                       deserializeG2   1.009msec
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0

status-im / nim-blscurve

Benchmarks vs MCL on ARM #28