Achieve competitive performance on AVX2 compared to official Kyber implementation

hanno-becker commented 3 days ago

This repository reuses various AVX2 intrinsics and assembly routines from the official Kyber implementation. In other places however -- e.g. key generation or SHAKE -- we deliberately don't follow the Kyber implementation, but keep the code simpler and close to the reference implementation.

It is likely (and in my mind acceptable) that this simplicity comes at a slight loss of performance. Yet, we should still strive to be within 5-10% performance of the Kyber implementation on common x86_64 systems.

mkannwischer commented 2 days ago

Just to give a snapshot: Currently (https://github.com/pq-code-package/mlkem-c-aarch64/commit/e8d3246ff1f01a5eaacd23fa5873dbd0a7c58c64), I get the following on my Raptor Lake:

INFO  > Benchmark
INFO  > make CROSS_PREFIX= bench CYCLES=PERF OPT=1 AUTO=1
INFO  > ./test/build/mlkem512/bin/bench_kyber512
INFO  > ML-KEM-512
INFO  > 
   keypair cycles = 24303
    encaps cycles = 31453
    decaps cycles = 34247

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  23778  23885  24023  24127  24215  24303  24401  24465  24611  24887  26488
    encaps percentiles:  30917  31028  31176  31283  31373  31453  31538  31635  31831  32168  33534
    decaps percentiles:  33699  33824  33942  34036  34161  34247  34320  34416  34572  34982  36435

INFO  > ./test/build/mlkem768/bin/bench_kyber768
INFO  > ML-KEM-768
INFO  > 
   keypair cycles = 42976
    encaps cycles = 47622
    decaps cycles = 51491

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  42315  42683  42798  42862  42913  42976  43053  43154  43286  43686  45460
    encaps percentiles:  47034  47323  47420  47485  47542  47622  47681  47833  48000  48495  50133
    decaps percentiles:  50916  51164  51278  51353  51412  51491  51568  51671  51824  52348  53910

INFO  > ./test/build/mlkem1024/bin/bench_kyber1024
INFO  > ML-KEM-1024
INFO  > 
   keypair cycles = 62550
    encaps cycles = 73023
    decaps cycles = 78353

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  60867  61466  61767  62056  62329  62550  62771  63073  63350  63960  65614
    encaps percentiles:  71320  71820  72151  72464  72699  73023  73223  73496  73825  74315  76631
    decaps percentiles:  76730  77213  77591  77857  78058  78353  78567  78853  79173  79836  82058

While I get this for the Kyber repo (using the same compiler -- gcc 13.2.0):

# ML-KEM-512
kyber_keypair_derand: 
median: 25028 cycles/ticks
average: 25072 cycles/ticks

kyber_keypair: 
median: 27038 cycles/ticks
average: 27188 cycles/ticks

kyber_encaps_derand: 
median: 27322 cycles/ticks
average: 27358 cycles/ticks

kyber_encaps: 
median: 28602 cycles/ticks
average: 28521 cycles/ticks

kyber_decaps: 
median: 30138 cycles/ticks
average: 30166 cycles/ticks

# ML-KEM-768
kyber_keypair_derand: 
median: 41366 cycles/ticks
average: 41511 cycles/ticks

kyber_keypair: 
median: 44376 cycles/ticks
average: 44653 cycles/ticks

kyber_encaps_derand: 
median: 42050 cycles/ticks
average: 42136 cycles/ticks

kyber_encaps: 
median: 43562 cycles/ticks
average: 43721 cycles/ticks

kyber_decaps: 
median: 46922 cycles/ticks
average: 47014 cycles/ticks

# ML-KEM-1024

kyber_keypair_derand: 
median: 58046 cycles/ticks
average: 58158 cycles/ticks

kyber_keypair: 
median: 62010 cycles/ticks
average: 62433 cycles/ticks

kyber_encaps_derand: 
median: 60766 cycles/ticks
average: 60901 cycles/ticks

kyber_encaps: 
median: 62212 cycles/ticks
average: 62254 cycles/ticks

kyber_decaps: 
median: 66882 cycles/ticks
average: 66990 cycles/ticks

mkannwischer commented 2 days ago

Note that with gcc 14.2.1 I get much better performance for the Kyber repo.

# ML-KEM-768 
kyber_keypair_derand: 
median: 37936 cycles/ticks
average: 38033 cycles/ticks

kyber_keypair: 
median: 40764 cycles/ticks
average: 41413 cycles/ticks

kyber_encaps_derand: 
median: 38540 cycles/ticks
average: 38624 cycles/ticks

kyber_encaps: 
median: 39800 cycles/ticks
average: 39949 cycles/ticks

kyber_decaps: 
median: 43072 cycles/ticks
average: 43188 cycles/ticks

hanno-becker commented 2 days ago

Thanks @mkannwischer, that doesn't look so bad for gcc13. 41.5k vs 43k for keygen seems fine, but it would be good to understand the larger gap for encaps and decaps.

hanno-becker commented 2 days ago

@mkannwischer What's the performance of this repo when compiled with gcc 14.2.1?

Have you compared compile flags with the Kyber repo?

hanno-becker commented 1 day ago

Performance is indeed highly compiler-dependent. Some more tests on c7i:

(venv) :~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=clang-18 CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
   keypair cycles = 30418
    encaps cycles = 34533
    decaps cycles = 37735

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  30198  30272  30312  30353  30382  30418  30454  30497  30560  30741  31777
    encaps percentiles:  34228  34326  34387  34430  34482  34533  34588  34637  34715  35115  35949
    decaps percentiles:  37156  37289  37376  37486  37596  37735  37853  37983  38098  38323  39353
(venv) :~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=gcc-14 CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
   keypair cycles = 31730
    encaps cycles = 37722
    decaps cycles = 41285

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  31013  31214  31342  31456  31602  31730  31879  32033  32268  32563  33592
    encaps percentiles:  37033  37186  37276  37424  37557  37722  37996  38312  38697  39052  40037
    decaps percentiles:  40484  40631  40776  40963  41138  41285  41484  41724  42017  42422  43388
(venv) ~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=gcc CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
   keypair cycles = 32157
    encaps cycles = 38525
    decaps cycles = 42091

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  31380  31499  31652  31790  31942  32157  32576  32868  33351  33774  34998
    encaps percentiles:  37577  37739  37898  38056  38255  38525  38815  39154  39547  39979  41279
    decaps percentiles:  41170  41336  41527  41670  41850  42091  42398  42776  43114  43579  44869
(venv) ~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument" CC=gcc CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
   keypair cycles = 43887
    encaps cycles = 49830
    decaps cycles = 53621

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  42850  43234  43468  43629  43758  43887  44033  44198  44488  44954  46840
    encaps percentiles:  48801  49132  49352  49562  49696  49830  49988  50224  50543  51088  53141
    decaps percentiles:  52527  52922  53164  53363  53500  53621  53763  53957  54331  54899  57025

hanno-becker commented 1 day ago

The performance of clang-18 is amazing.

hanno-becker commented 1 day ago

Note: Since #245 it's easier to test different compilers, by passing CC=COMPILER tests bench -r -c PERF etc.

pq-code-package / mlkem-c-aarch64

Achieve competitive performance on AVX2 compared to official Kyber implementation #224