Open mratsim opened 4 years ago
I am aware of this and don't know any good solutions -- I have tried various relatively simple source-level changes with little success. I made a table of different levels of slowness across gcc/clang/icc/asm and intrinsics/nointrinsics for the Oakland paper linked from the description of the repo, showing similarly dramatic differences back then. Improving low-level compilation, especially when it comes to carry flags, would be a most valuable improvement upon the current state of fiat-crypto.
I've been tracking GCC performance issues and found the 3 following bugs/mailing list discussions on carries/borrows:
I don't see any "Oakland" in the description, is this this paper? http://adam.chlipala.net/papers/FiatCryptoSP19/FiatCryptoSP19.pdf
and this table?
In terms of speed those are the figures I get in my own library Constantine, note that I'm focusing on pairing curves and not on curves with a generalized Mersenne prime modulus and inversion is using Little Fermat theorem with a constant time implementation so is quite slow.
⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================
Addition Fp[Secp256k1] 4 ns 14 cycles
Substraction Fp[Secp256k1] 3 ns 10 cycles
Negation Fp[Secp256k1] 2 ns 6 cycles
Multiplication Fp[Secp256k1] 34 ns 104 cycles
Squaring Fp[Secp256k1] 33 ns 99 cycles
Inversion Fp[Secp256k1] 11884 ns 35654 cycles
⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================
Addition Fp[BN254] 4 ns 13 cycles
Substraction Fp[BN254] 3 ns 10 cycles
Negation Fp[BN254] 2 ns 6 cycles
Multiplication Fp[BN254] 32 ns 98 cycles
Squaring Fp[BN254] 29 ns 88 cycles
Inversion Fp[BN254] 10708 ns 32124 cycles
⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================
Addition Fp[BLS12_381] 9 ns 27 cycles
Substraction Fp[BLS12_381] 5 ns 15 cycles
Negation Fp[BLS12_381] 3 ns 10 cycles
Multiplication Fp[BLS12_381] 62 ns 188 cycles
Squaring Fp[BLS12_381] 57 ns 173 cycles
Inversion Fp[BLS12_381] 29462 ns 88386 cycles
⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================
Addition Fp[Secp256k1] 3 ns 9 cycles
Substraction Fp[Secp256k1] 2 ns 6 cycles
Negation Fp[Secp256k1] 0 ns 0 cycles
Multiplication Fp[Secp256k1] 22 ns 68 cycles
Squaring Fp[Secp256k1] 20 ns 62 cycles
Inversion Fp[Secp256k1] 9171 ns 27513 cycles
⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================
Addition Fp[BN254] 2 ns 8 cycles
Substraction Fp[BN254] 2 ns 6 cycles
Negation Fp[BN254] 0 ns 0 cycles
Multiplication Fp[BN254] 21 ns 64 cycles
Squaring Fp[BN254] 18 ns 55 cycles
Inversion Fp[BN254] 8509 ns 25528 cycles
⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================
Addition Fp[BLS12_381] 3 ns 11 cycles
Substraction Fp[BLS12_381] 2 ns 8 cycles
Negation Fp[BLS12_381] 0 ns 0 cycles
Multiplication Fp[BLS12_381] 45 ns 137 cycles
Squaring Fp[BLS12_381] 39 ns 118 cycles
Inversion Fp[BLS12_381] 22785 ns 68355 cycles
With Clang, my implementation is about 1.72x faster than Fiat Crypto (using Goff optimizations at https://hackmd.io/@zkteam/modular_multiplication). I'm not using inline-assembly except for cmov for the final conditional substraction so the code generated by Fiat-Crypto should be able to reach within 10% of that
Here are the performance figures of other pairing libraries used for blockchain protocols and/or Zero-Knowledge Proofs. I've used the native/included benchmarking facilities and primes and some are generics and some specialized, also some report clock cycles and some nanoseconds.
Goff in Go which also uses code generation, also reaches 45ns multiplication for a 6 limbs prime (BLS12-377 from their example)
(Inversion and exponentiation are not constant-time, the final conditional substraction after add/mul/square is not constant-time)
$ benchstat bls377.txt
name time/op
InverseELEMENT 4.50µs ± 3%
ExpELEMENT 4.58µs ± 0%
DoubleELEMENT 8.11ns ± 0%
AddELEMENT 4.41ns ±15%
SubELEMENT 5.13ns ±25%
NegELEMENT 3.83ns ± 1%
DivELEMENT 4.54µs ± 1%
FromMontELEMENT 31.1ns ± 0%
ToMontELEMENT 44.9ns ± 0%
SquareELEMENT 41.5ns ± 0%
SqrtELEMENT 38.4µs ±43%
MulAssignELEMENT 44.4ns ± 0%
And for a 4 limbs prime (BN254)
$ benchstat bn256.txt
name time/op
InverseELEMENT 1.97µs ± 3%
ExpELEMENT 2.22µs ± 1%
DoubleELEMENT 6.97ns ± 0%
AddELEMENT 3.69ns ±16%
SubELEMENT 3.61ns ±14%
NegELEMENT 2.76ns ± 0%
DivELEMENT 2.02µs ± 1%
FromMontELEMENT 16.3ns ± 0%
ToMontELEMENT 22.5ns ± 0%
SquareELEMENT 19.3ns ± 0%
SqrtELEMENT 7.50µs ± 0%
MulAssignELEMENT 21.8ns ± 0%
Barretenberg is specialized on BN254 is a C++ library and uses inline assembly on X86_64. I used the default implementation (MULX/ADC) and not the full optimization (MULX+ADCX+ADOX)
4-limbs prime
sqr_assign clocks per operation = 38.9129
sqr_assign clocks per operation = 38.2849
sqr clocks per operation = 51.8644
sqr clocks per operation = 51.718
unary minus clocks per operation = 22.4781
unary minus clocks per operation = 22.4975
static mul assign clocks per operation = 7.50531
static mul assign clocks per operation = 7.43059
static mul assign clocks per operation = 7.45982
mul assign clocks per operation = 40.5682
mul assign clocks per operation = 40.793
mul clocks per operation = 54.6559
mul clocks per operation = 54.6534
self add clocks per operation = 7.35761
self add clocks per operation = 7.35797
self add clocks per operation = 7.36054
add clocks per operation = 20.6533
add clocks per operation = 20.6609
sub clocks per operation = 22.5185
sub clocks per operation = 22.5621
inversion clocks per call = 15030.8
(mul assign, sqr assign, and self add should be the in-place versions)
As far as I know, MCL is the fastest open-source pairing library.
It has 2 backends, one with assembly generated from LLVM native wide-integer (i256, i384, i448, i512, ...) and one using a JIT Assembler.
6-limbs prime (BLS12-381) The LLVM IR source code is something like this https://github.com/herumi/mie/blob/eab53850/src/t.ll generated from this template https://github.com/herumi/mie/blob/eab53850/src/mul.txt#L48-L80 with wider types.
Fp::add 21.75 clk
Fp::sub 18.96 clk
Fp::neg 8.85 clk
Fp::mul 126.98 clk
Fp::sqr 126.88 clk
Fp::inv 64.019Kclk
6-limbs prime (BLS12-381) The JIT uses MULX+ADCX+ADOX on my machine
Fp::add 20.83 clk
Fp::sub 8.66 clk
Fp::neg 8.05 clk
Fp::mul 105.89 clk
Fp::sqr 98.45 clk
Fp::inv 63.970Kclk
It might be interesting to add an LLVM IR target besides the C, Go, Rust. This target could use the wide integer types (AFAIK 2^22 bits is the maximum so even i3072 or i4098 are supported) and avoids C limitations.
I mean that paper, but the next table (for p256). I totally should have given you the link, but I had the on-a-mobile-device excuse. Anyway, there are no surprises there -- just yet another observation that C compilers have inconsistent performance with carries.
Relying on an external implementation of large integers (e.g. from clang) is an interesting idea. It does seem reasonable to expect that llvm would do a good job compiling the big-integer operations it specifically supports, so we could get a non-trivial speedup in saturated arithmetic. We also have been trying to keep fiat-crypto from relying on external arithmetic code, but the speedup seems worth the complexity for users who already rely on llvm. It's also great that you linked us to the template -- I think it is a very useful starting point for replicating the same strategy in fiat-crypto. Now, there is only the question of who will implement it in their copious free time -- and right now it will not be me, as I am busy with bedrock2. Thank you very much for the analysis and suggestion, though -- now we have an actionable direction for improving carry performance.
Is https://github.com/mratsim/constantine/blob/master/constantine/arithmetic/montgomery.nim#L121 the code whose performance fiat-crypto should be able to match while outputting C?
Is https://github.com/mratsim/constantine/blob/master/constantine/arithmetic/montgomery.nim#L121 the code whose performance fiat-crypto should be able to match while outputting C?
This one is the generic Montgomery Multiplication, it is used for secp256k1 and all primes (NIST primes ...) where the last word has it's most-significant-bit set (i.e. representation is 0b1111...).
If the last bit is not set (i.e. representation is 0b0111... at most), I use a no carry version just below: https://github.com/mratsim/constantine/blob/1958356a/constantine/arithmetic/montgomery.nim#L89-L119 as explained here https://hackmd.io/@zkteam/modular_multiplication
On 6 limbs, the carry version is ~157 cycles and the no-carry version is ~137 cycles. There is a proof in the same link if you want to add this to Fiat.
muladd1 and muladd2 are just uint128 code: https://github.com/mratsim/constantine/blob/1958356a/constantine/primitives/extended_precision_64bit_uint128.nim#L57-L99
Hello team,
Congratulations on your project it's really appreciated.
I've notice that GCC produces very slow code compared to Clang from fiat sources
Bench GCC:
Bench Clang
Reproduction
The code is for the BLS12_381 as generated by Relic in https://github.com/mit-plv/fiat-crypto/pull/670 and https://github.com/relic-toolkit/relic/blob/5cdabd57/src/low/x64-fiat-381/bls12_381_q_64.c
I've wrapped it in Nim with a simple benchmark code at https://github.com/mratsim/constantine/tree/1958356a/formal_verification that measures the number of clock cycles and monotonic time of 1M iterations and derive the ns/op and cycles/op.
Flags:
-d:danger
Remove all bound checks and compile with-O3
Caveats
My CPU has a nominal clock of 3GHz but is overclocked (all-core turbo) at 4.1GHz and so the clock cycles and nanoseconds measured are probably approximate and only valid on my machine.
Suggestion
I'm not very familiar with GCC standard for performance "bugs" (i.e. would that be dismissed?) but for users, a notice about compiler slowness might be worth it given that Clang is over 70% faster than GCC
Explanation
GCC performance for multi-precision arithmetic is unfortunately abysmal. In my own cryptographic code the gap with Clang is about 30% so Fiat is hitting an even worse case.
In particular it does not handle carry properly (see https://gcc.godbolt.org/z/2h768y) even when using the real addcarry_u64 intrinsics
GCC
Clang