Performance - GCC is very slow compared to Clang (70% slower)

mratsim commented 4 years ago

Hello team,

Congratulations on your project it's really appreciated.

I've notice that GCC produces very slow code compared to Clang from fiat sources

Bench GCC:

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        FiatCrypto[BLS12_381]        12 ns        36 cycles
Substraction    FiatCrypto[BLS12_381]        12 ns        36 cycles
Negation        FiatCrypto[BLS12_381]         9 ns        29 cycles
Multiplication  FiatCrypto[BLS12_381]       135 ns       406 cycles
Squaring        FiatCrypto[BLS12_381]       133 ns       401 cycles

$  gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-pkgversion='Arch Linux 9.3.0-1' --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++,d --enable-shared --enable-threads=posix --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
gcc version 9.3.0 (Arch Linux 9.3.0-1)

Bench Clang

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        FiatCrypto[BLS12_381]         8 ns        25 cycles
Substraction    FiatCrypto[BLS12_381]         8 ns        25 cycles
Negation        FiatCrypto[BLS12_381]         6 ns        18 cycles
Multiplication  FiatCrypto[BLS12_381]        79 ns       237 cycles
Squaring        FiatCrypto[BLS12_381]        78 ns       235 cycles

$  clang -v
clang version 9.0.1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/bin/../lib/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/8.4.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Selected GCC installation: /usr/bin/../lib64/gcc/x86_64-pc-linux-gnu/9.3.0
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /opt/cuda, version unknown

Reproduction

The code is for the BLS12_381 as generated by Relic in https://github.com/mit-plv/fiat-crypto/pull/670 and https://github.com/relic-toolkit/relic/blob/5cdabd57/src/low/x64-fiat-381/bls12_381_q_64.c

I've wrapped it in Nim with a simple benchmark code at https://github.com/mratsim/constantine/tree/1958356a/formal_verification that measures the number of clock cycles and monotonic time of 1M iterations and derive the ns/op and cycles/op.

# Build
# Install Nim, available in all Linux distros and MacOS

git clone https://github.com/mratsim/constantine
cd constantine
mkdir build
nim c -d:danger --cc:gcc -o:build/fiat_bls12_381_gcc formal_verification/bls12_381_q_64.nim
nim c -d:danger --cc:clang -o:build/fiat_bls12_381_clang formal_verification/bls12_381_q_64.nim

# Run
build/fiat_bls12_381_gcc
build/fiat_bls12_381_clang

Flags: -d:danger Remove all bound checks and compile with -O3

Caveats

My CPU has a nominal clock of 3GHz but is overclocked (all-core turbo) at 4.1GHz and so the clock cycles and nanoseconds measured are probably approximate and only valid on my machine.

Suggestion

I'm not very familiar with GCC standard for performance "bugs" (i.e. would that be dismissed?) but for users, a notice about compiler slowness might be worth it given that Clang is over 70% faster than GCC

Explanation

GCC performance for multi-precision arithmetic is unfortunately abysmal. In my own cryptographic code the gap with Clang is about 30% so Fiat is hitting an even worse case.

In particular it does not handle carry properly (see https://gcc.godbolt.org/z/2h768y) even when using the real addcarry_u64 intrinsics

#include <stdint.h>
#include <x86intrin.h>

void add256(uint64_t a[4], uint64_t b[4]){
  uint8_t carry = 0;
  for (int i = 0; i < 4; ++i)
    carry = _addcarry_u64(carry, a[i], b[i], &a[i]);
}

GCC

add256:
        movq    (%rsi), %rax
        addq    (%rdi), %rax
        setc    %dl
        movq    %rax, (%rdi)
        movq    8(%rdi), %rax
        addb    $-1, %dl
        adcq    8(%rsi), %rax
        setc    %dl
        movq    %rax, 8(%rdi)
        movq    16(%rdi), %rax
        addb    $-1, %dl
        adcq    16(%rsi), %rax
        setc    %dl
        movq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        addb    $-1, %dl
        adcq    %rax, 24(%rdi)
        ret

Clang

add256:
        movq    (%rsi), %rax
        addq    %rax, (%rdi)
        movq    8(%rsi), %rax
        adcq    %rax, 8(%rdi)
        movq    16(%rsi), %rax
        adcq    %rax, 16(%rdi)
        movq    24(%rsi), %rax
        adcq    %rax, 24(%rdi)
        retq

andres-erbsen commented 4 years ago

I am aware of this and don't know any good solutions -- I have tried various relatively simple source-level changes with little success. I made a table of different levels of slowness across gcc/clang/icc/asm and intrinsics/nointrinsics for the Oakland paper linked from the description of the repo, showing similarly dramatic differences back then. Improving low-level compilation, especially when it comes to carry flags, would be a most valuable improvement upon the current state of fiat-crypto.

mratsim commented 4 years ago

I've been tracking GCC performance issues and found the 3 following bugs/mailing list discussions on carries/borrows:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67317 (fixed but fix is wrong?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79173 (support add carry and sub borrow)
https://gcc.gnu.org/legacy-ml/gcc-help/2017-08/msg00100.html (GCC is not able to distinguish carry chains)

I don't see any "Oakland" in the description, is this this paper? http://adam.chlipala.net/papers/FiatCryptoSP19/FiatCryptoSP19.pdf

and this table?

Speed

In terms of speed those are the figures I get in my own library Constantine, note that I'm focusing on pairing curves and not on curves with a generalized Mersenne prime modulus and inversion is using Little Fermat theorem with a constant time implementation so is quite slow.

GCC

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]           4 ns        14 cycles
Substraction    Fp[Secp256k1]           3 ns        10 cycles
Negation        Fp[Secp256k1]           2 ns         6 cycles
Multiplication  Fp[Secp256k1]          34 ns       104 cycles
Squaring        Fp[Secp256k1]          33 ns        99 cycles
Inversion       Fp[Secp256k1]       11884 ns     35654 cycles

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]               4 ns        13 cycles
Substraction    Fp[BN254]               3 ns        10 cycles
Negation        Fp[BN254]               2 ns         6 cycles
Multiplication  Fp[BN254]              32 ns        98 cycles
Squaring        Fp[BN254]              29 ns        88 cycles
Inversion       Fp[BN254]           10708 ns     32124 cycles

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]           9 ns        27 cycles
Substraction    Fp[BLS12_381]           5 ns        15 cycles
Negation        Fp[BLS12_381]           3 ns        10 cycles
Multiplication  Fp[BLS12_381]          62 ns       188 cycles
Squaring        Fp[BLS12_381]          57 ns       173 cycles
Inversion       Fp[BLS12_381]       29462 ns     88386 cycles

Clang

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[Secp256k1]           3 ns         9 cycles
Substraction    Fp[Secp256k1]           2 ns         6 cycles
Negation        Fp[Secp256k1]           0 ns         0 cycles
Multiplication  Fp[Secp256k1]          22 ns        68 cycles
Squaring        Fp[Secp256k1]          20 ns        62 cycles
Inversion       Fp[Secp256k1]        9171 ns     27513 cycles

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BN254]               2 ns         8 cycles
Substraction    Fp[BN254]               2 ns         6 cycles
Negation        Fp[BN254]               0 ns         0 cycles
Multiplication  Fp[BN254]              21 ns        64 cycles
Squaring        Fp[BN254]              18 ns        55 cycles
Inversion       Fp[BN254]            8509 ns     25528 cycles

⚠️ Measurements are approximate and use the CPU nominal clock: Turbo-Boost and overclocking will skew them.
==========================================================================================================

Addition        Fp[BLS12_381]           3 ns        11 cycles
Substraction    Fp[BLS12_381]           2 ns         8 cycles
Negation        Fp[BLS12_381]           0 ns         0 cycles
Multiplication  Fp[BLS12_381]          45 ns       137 cycles
Squaring        Fp[BLS12_381]          39 ns       118 cycles
Inversion       Fp[BLS12_381]       22785 ns     68355 cycles

Analysis

With Clang, my implementation is about 1.72x faster than Fiat Crypto (using Goff optimizations at https://hackmd.io/@zkteam/modular_multiplication). I'm not using inline-assembly except for cmov for the final conditional substraction so the code generated by Fiat-Crypto should be able to reach within 10% of that

Others

Here are the performance figures of other pairing libraries used for blockchain protocols and/or Zero-Knowledge Proofs. I've used the native/included benchmarking facilities and primes and some are generics and some specialized, also some report clock cycles and some nanoseconds.

Goff

Goff in Go which also uses code generation, also reaches 45ns multiplication for a 6 limbs prime (BLS12-377 from their example)

(Inversion and exponentiation are not constant-time, the final conditional substraction after add/mul/square is not constant-time)

$  benchstat bls377.txt
name              time/op
InverseELEMENT    4.50µs ± 3%
ExpELEMENT        4.58µs ± 0%
DoubleELEMENT     8.11ns ± 0%
AddELEMENT        4.41ns ±15%
SubELEMENT        5.13ns ±25%
NegELEMENT        3.83ns ± 1%
DivELEMENT        4.54µs ± 1%
FromMontELEMENT   31.1ns ± 0%
ToMontELEMENT     44.9ns ± 0%
SquareELEMENT     41.5ns ± 0%
SqrtELEMENT       38.4µs ±43%
MulAssignELEMENT  44.4ns ± 0%

And for a 4 limbs prime (BN254)

$  benchstat bn256.txt 
name              time/op
InverseELEMENT    1.97µs ± 3%
ExpELEMENT        2.22µs ± 1%
DoubleELEMENT     6.97ns ± 0%
AddELEMENT        3.69ns ±16%
SubELEMENT        3.61ns ±14%
NegELEMENT        2.76ns ± 0%
DivELEMENT        2.02µs ± 1%
FromMontELEMENT   16.3ns ± 0%
ToMontELEMENT     22.5ns ± 0%
SquareELEMENT     19.3ns ± 0%
SqrtELEMENT       7.50µs ± 0%
MulAssignELEMENT  21.8ns ± 0%

Barretenberg

Barretenberg is specialized on BN254 is a C++ library and uses inline assembly on X86_64. I used the default implementation (MULX/ADC) and not the full optimization (MULX+ADCX+ADOX)

4-limbs prime

sqr_assign clocks per operation = 38.9129
sqr_assign clocks per operation = 38.2849

sqr clocks per operation = 51.8644
sqr clocks per operation = 51.718

unary minus clocks per operation = 22.4781
unary minus clocks per operation = 22.4975

static mul assign clocks per operation = 7.50531
static mul assign clocks per operation = 7.43059
static mul assign clocks per operation = 7.45982

mul assign clocks per operation = 40.5682
mul assign clocks per operation = 40.793

mul clocks per operation = 54.6559
mul clocks per operation = 54.6534

self add clocks per operation = 7.35761
self add clocks per operation = 7.35797
self add clocks per operation = 7.36054

add clocks per operation = 20.6533
add clocks per operation = 20.6609

sub clocks per operation = 22.5185
sub clocks per operation = 22.5621

inversion clocks per call = 15030.8

(mul assign, sqr assign, and self add should be the in-place versions)

MCL

As far as I know, MCL is the fastest open-source pairing library.

It has 2 backends, one with assembly generated from LLVM native wide-integer (i256, i384, i448, i512, ...) and one using a JIT Assembler.

LLVM wide-integer

6-limbs prime (BLS12-381) The LLVM IR source code is something like this https://github.com/herumi/mie/blob/eab53850/src/t.ll generated from this template https://github.com/herumi/mie/blob/eab53850/src/mul.txt#L48-L80 with wider types.

Fp::add         21.75 clk
Fp::sub         18.96 clk
Fp::neg          8.85 clk
Fp::mul        126.98 clk
Fp::sqr        126.88 clk
Fp::inv         64.019Kclk

JIT

6-limbs prime (BLS12-381) The JIT uses MULX+ADCX+ADOX on my machine

Fp::add         20.83 clk
Fp::sub          8.66 clk
Fp::neg          8.05 clk
Fp::mul        105.89 clk
Fp::sqr         98.45 clk
Fp::inv         63.970Kclk

Suggestion

It might be interesting to add an LLVM IR target besides the C, Go, Rust. This target could use the wide integer types (AFAIK 2^22 bits is the maximum so even i3072 or i4098 are supported) and avoids C limitations.

andres-erbsen commented 4 years ago

I mean that paper, but the next table (for p256). I totally should have given you the link, but I had the on-a-mobile-device excuse. Anyway, there are no surprises there -- just yet another observation that C compilers have inconsistent performance with carries.

Relying on an external implementation of large integers (e.g. from clang) is an interesting idea. It does seem reasonable to expect that llvm would do a good job compiling the big-integer operations it specifically supports, so we could get a non-trivial speedup in saturated arithmetic. We also have been trying to keep fiat-crypto from relying on external arithmetic code, but the speedup seems worth the complexity for users who already rely on llvm. It's also great that you linked us to the template -- I think it is a very useful starting point for replicating the same strategy in fiat-crypto. Now, there is only the question of who will implement it in their copious free time -- and right now it will not be me, as I am busy with bedrock2. Thank you very much for the analysis and suggestion, though -- now we have an actionable direction for improving carry performance.

Is https://github.com/mratsim/constantine/blob/master/constantine/arithmetic/montgomery.nim#L121 the code whose performance fiat-crypto should be able to match while outputting C?

mratsim commented 4 years ago

Is https://github.com/mratsim/constantine/blob/master/constantine/arithmetic/montgomery.nim#L121 the code whose performance fiat-crypto should be able to match while outputting C?

This one is the generic Montgomery Multiplication, it is used for secp256k1 and all primes (NIST primes ...) where the last word has it's most-significant-bit set (i.e. representation is 0b1111...).

If the last bit is not set (i.e. representation is 0b0111... at most), I use a no carry version just below: https://github.com/mratsim/constantine/blob/1958356a/constantine/arithmetic/montgomery.nim#L89-L119 as explained here https://hackmd.io/@zkteam/modular_multiplication

On 6 limbs, the carry version is ~157 cycles and the no-carry version is ~137 cycles. There is a proof in the same link if you want to add this to Fiat.

muladd1 and muladd2 are just uint128 code: https://github.com/mratsim/constantine/blob/1958356a/constantine/primitives/extended_precision_64bit_uint128.nim#L57-L99

mit-plv / fiat-crypto