Closed mratsim closed 3 years ago
On the same Raspberry Pi 4b on ARM Broadcom BCM2711, Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz
Note: I am using 32-bit Raspbian so Milagro was compiled with 32-bit limbs, you can expect 2x more perf in 64-bit mode:
Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
⚠️ Warning: using Milagro with 32-bit limbs
=================================================================================================================
Scalar multiplication G1 194.349 ops/s 5145377 ns/op
Scalar multiplication G2 71.983 ops/s 13892224 ns/op
EC add G1 75746.099 ops/s 13202 ns/op
EC add G2 25377.490 ops/s 39405 ns/op
Pairing (Milagro builtin double pairing) 38.146 ops/s 26214868 ns/op
Pairing (Multi-Pairing with delayed Miller and Exp) 38.008 ops/s 26310474 ns/op
⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
This is an outdated draft.
Hash to G2 (Draft #5) 75.663 ops/s 13216584 ns/op
On a Huawei P20 Lite phone (2018 entry/midrange phone) Processor HiSilicon Kirin 659 2360MHz (ARM v8)
Milagro in 64-bit mode
Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
=================================================================================================================
Scalar multiplication G1 284.394 ops/s 3516247 ns/op
Scalar multiplication G2 102.853 ops/s 9722601 ns/op
EC add G1 105864.916 ops/s 9446 ns/op
EC add G2 36911.265 ops/s 27092 ns/op
Pairing (Milagro builtin double pairing) 50.957 ops/s 19624477 ns/op
Pairing (Multi-Pairing with delayed Miller and Exp) 50.068 ops/s 19972640 ns/op
⚠️ Warning: using draft v5 of IETF Hash-To-Curve (HKDF-based).
This is an outdated draft.
Hash to G2 (Draft #5) 117.156 ops/s 8535621 ns/op
MCL on the same phone, 64-bit as well
EC Add G2 is 2x faster than Milagro Scalar multiplication is 2x faster than Milagro Pairing is 2.5x faster than Milagro
JIT 0
ctest:module=size
ctest:module=naive
i=0 curve=BLS12_381
G1
G2
GT
G1::mulCT 741.964usec
G1::mul 760.436usec
G1::add 4.797usec
G1::dbl 3.202usec
G2::mulCT 1.627msec
G2::mul 1.675msec
G2::add 14.409usec
G2::dbl 8.494usec
GT::pow 2.598msec
G1::setStr chk 1.078msec
G1::setStr 11.047usec
G2::setStr chk 2.926msec
G2::setStr 22.819usec
hashAndMapToG1 1.373msec
hashAndMapToG2 2.767msec
Fp::add 23.30nsec
Fp::sub 22.19nsec
Fp::neg 18.86nsec
Fp::mul 410.98nsec
Fp::sqr 412.34nsec
Fp::inv 147.942usec
Fp2::add 47.42nsec
Fp2::sub 45.46nsec
Fp2::neg 34.94nsec
Fp2::mul 1.395usec
Fp2::mul_xi 70.49nsec
Fp2::sqr 917.64nsec
Fp2::inv 151.215usec
FpDbl::addPre 41.34nsec
FpDbl::subPre 43.81nsec
FpDbl::add 44.43nsec
FpDbl::sub 42.14nsec
FpDbl::mulPre 243.68nsec
FpDbl::sqrPre 159.24nsec
FpDbl::mod 228.10nsec
Fp2Dbl::mulPre 912.43nsec
Fp2Dbl::sqrPre 552.47nsec
GT::add 293.75nsec
GT::mul 24.773usec
GT::sqr 17.505usec
GT::inv 202.780usec
FpDbl::mulPre 243.44nsec
pairing 7.627msec
millerLoop 3.251msec
finalExp 4.361msec
precomputeG2 823.918usec
precomputedML 2.424msec
millerLoopVec 18.560msec
ctest:module=finalExp
finalExp 4.355msec
ctest:module=mul_012
ctest:module=pairing
ctest:module=multi
BN254
calcBN1 386.328usec
naiveG2 75.901usec
calcBN2 730.271usec
naiveG2 456.292usec
BLS12_381
calcBN1 695.115usec
naiveG1 243.682usec
calcBN2 1.440msec
naiveG2 935.823usec
ctest:module=eth2
mapToG2 org-cofactor 4.951msec
mapToG2 fast-cofactor 3.194msec ctest:module=deserialize verifyOrder(1) deserializeG1 1.500msec
deserializeG2 3.913msec verifyOrder(0)
deserializeG1 432.423usec deserializeG2 1.009msec
ctest:name=bls12_test, module=8, total=3600, ok=3600, ng=0, exception=0
For reference, this is MCL speed on ARM-32 Rpi 4
https://github.com/mratsim/mcl/blob/2b318e84/bench-arm32-pi4.log
Reproduction command: