Benchmarks - Githubissues

mratsim commented 4 years ago

x86-64

nim c -r --passC:-g -d:danger --hints:off --warnings:off --verbosity:0 --outdir:build benchmarks/bench_all.nim

Warmup: 0.9026 s, result 224 (displayed to avoid compiler optimizing warmup away)

Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz

⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=====================================================================================================================

Scalar multiplication G1 (255-bit)                             7649.939 ops/s       130720 ns/op       392165 cycles
Scalar multiplication G2 (255-bit)                             2973.783 ops/s       336272 ns/op      1008830 cycles
EC add G1                                                   1295336.788 ops/s          772 ns/op         2317 cycles
EC add G2                                                    452488.688 ops/s         2210 ns/op         6631 cycles
Pairing (Miller loop + Final Exponentiation)                   1315.289 ops/s       760289 ns/op      2280892 cycles
Hash to G2 (Draft #8)                                          3240.304 ops/s       308613 ns/op       925851 cycles

On Broadwell CPUs (Intel 2015) or Ryzen CPUs (AMD 2017) or later support the "ADX" instructions dedicated to big integer arithmetics
You might want to benchmark with --passC:-madx or --passC:"-march=native" to use them.

x86-64 + ADX instructions

nim c -r --passC:"-g -madx" -d:danger --hints:off --warnings:off --verbosity:0 --outdir:build benchmarks/bench_all.nim

Warmup: 0.9030 s, result 224 (displayed to avoid compiler optimizing warmup away)

Compiled with GCC
Optimization level => no optimization: false | release: true | danger: true
Using Milagro with 64-bit limbs
Running on Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz

⚠️ Cycles measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
i.e. a 20% overclock will be about 20% off (assuming no dynamic frequency scaling)

=====================================================================================================================

Scalar multiplication G1 (255-bit)                             9631.777 ops/s       103823 ns/op       311473 cycles
Scalar multiplication G2 (255-bit)                             3768.863 ops/s       265332 ns/op       796006 cycles
EC add G1                                                   1706484.642 ops/s          586 ns/op         1758 cycles
EC add G2                                                    598444.045 ops/s         1671 ns/op         5015 cycles
Pairing (Miller loop + Final Exponentiation)                   1639.054 ops/s       610108 ns/op      1830347 cycles
Hash to G2 (Draft #8)                                          4270.876 ops/s       234144 ns/op       702442 cycles

On Broadwell CPUs (Intel 2015) or Ryzen CPUs (AMD 2017) or later support the "ADX" instructions dedicated to big integer arithmetics
You might want to benchmark with --passC:-madx or --passC:"-march=native" to use them.

Comparison

Compare with Milagro and MCL at https://github.com/status-im/nim-blscurve/issues/47

(MCL JIT vs BLST)

Scalar mul G1: 200kcycles vs 300kcycles
Scalar mul G2: 400kcycles vs 800kcycles
Pairing: 2.200Mcycles vs 1.8Mcycles
Hash to G2: 467kcycles vs 702kcycles

Analysis:

The scalar mul is probably slower due to missing endomorphism acceleration (divides the number of doublings by 2 on G1 and 4 on G2) https://github.com/supranational/blst/issues/1
Pairing and so message verification is 18% faster
Hash to G2 is probably bottlenecked from SQRT FP2 https://github.com/supranational/blst/issues/2 and clear_cofactor (which uses the same fast clearing method?)

Side-note on EC Add

MCL add is not constant-time, there are branches to detect infinity and adding the same or the opposite point while BLST always handle (add, double, infinity) cases.

dot-asm commented 4 years ago

Just in case for reference. Among other things performance is also about "perspectives" and priorities. Most notably it's also about multi-processor scalability. This is why some components are not 100% yet. And the keyword is "yet." However, this is not to say that feedback is not appreciated. It certainly is! As well as new pointers and reminders :-) Thanks and cheers!

mratsim commented 4 years ago

Just in case for reference. Among other things performance is also about "perspectives" and priorities. Most notably it's also about multi-processor scalability. This is why some components are not 100% yet. And the keyword is "yet." However, this is not to say that feedback is not appreciated. It certainly is! As well as new pointers and reminders :-) Thanks and cheers!

Thanks, from discussion with some Consensys ZK team during EthCC, they indeed were investigating an issue where they couldn't scale Snarks beyond 16 cores and were looking for solutions to this. It seems to be an important issue for all zero-knowledge actors as LoopRing (which uses a completely different stack) was also scalable only with up to 16 cores. https://medium.com/loopring-protocol/zksnark-prover-optimizations-3e9a3e5578c0

I'm not sure what the current status is at the moment.

status-im / nim-blst

Benchmarks #1

x86-64

x86-64 + ADX instructions

Comparison

Analysis: