Open hanno-becker opened 3 days ago
Just to give a snapshot: Currently (https://github.com/pq-code-package/mlkem-c-aarch64/commit/e8d3246ff1f01a5eaacd23fa5873dbd0a7c58c64), I get the following on my Raptor Lake:
INFO > Benchmark
INFO > make CROSS_PREFIX= bench CYCLES=PERF OPT=1 AUTO=1
INFO > ./test/build/mlkem512/bin/bench_kyber512
INFO > ML-KEM-512
INFO >
keypair cycles = 24303
encaps cycles = 31453
decaps cycles = 34247
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 23778 23885 24023 24127 24215 24303 24401 24465 24611 24887 26488
encaps percentiles: 30917 31028 31176 31283 31373 31453 31538 31635 31831 32168 33534
decaps percentiles: 33699 33824 33942 34036 34161 34247 34320 34416 34572 34982 36435
INFO > ./test/build/mlkem768/bin/bench_kyber768
INFO > ML-KEM-768
INFO >
keypair cycles = 42976
encaps cycles = 47622
decaps cycles = 51491
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 42315 42683 42798 42862 42913 42976 43053 43154 43286 43686 45460
encaps percentiles: 47034 47323 47420 47485 47542 47622 47681 47833 48000 48495 50133
decaps percentiles: 50916 51164 51278 51353 51412 51491 51568 51671 51824 52348 53910
INFO > ./test/build/mlkem1024/bin/bench_kyber1024
INFO > ML-KEM-1024
INFO >
keypair cycles = 62550
encaps cycles = 73023
decaps cycles = 78353
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 60867 61466 61767 62056 62329 62550 62771 63073 63350 63960 65614
encaps percentiles: 71320 71820 72151 72464 72699 73023 73223 73496 73825 74315 76631
decaps percentiles: 76730 77213 77591 77857 78058 78353 78567 78853 79173 79836 82058
While I get this for the Kyber repo (using the same compiler -- gcc 13.2.0):
# ML-KEM-512
kyber_keypair_derand:
median: 25028 cycles/ticks
average: 25072 cycles/ticks
kyber_keypair:
median: 27038 cycles/ticks
average: 27188 cycles/ticks
kyber_encaps_derand:
median: 27322 cycles/ticks
average: 27358 cycles/ticks
kyber_encaps:
median: 28602 cycles/ticks
average: 28521 cycles/ticks
kyber_decaps:
median: 30138 cycles/ticks
average: 30166 cycles/ticks
# ML-KEM-768
kyber_keypair_derand:
median: 41366 cycles/ticks
average: 41511 cycles/ticks
kyber_keypair:
median: 44376 cycles/ticks
average: 44653 cycles/ticks
kyber_encaps_derand:
median: 42050 cycles/ticks
average: 42136 cycles/ticks
kyber_encaps:
median: 43562 cycles/ticks
average: 43721 cycles/ticks
kyber_decaps:
median: 46922 cycles/ticks
average: 47014 cycles/ticks
# ML-KEM-1024
kyber_keypair_derand:
median: 58046 cycles/ticks
average: 58158 cycles/ticks
kyber_keypair:
median: 62010 cycles/ticks
average: 62433 cycles/ticks
kyber_encaps_derand:
median: 60766 cycles/ticks
average: 60901 cycles/ticks
kyber_encaps:
median: 62212 cycles/ticks
average: 62254 cycles/ticks
kyber_decaps:
median: 66882 cycles/ticks
average: 66990 cycles/ticks
Note that with gcc 14.2.1 I get much better performance for the Kyber repo.
# ML-KEM-768
kyber_keypair_derand:
median: 37936 cycles/ticks
average: 38033 cycles/ticks
kyber_keypair:
median: 40764 cycles/ticks
average: 41413 cycles/ticks
kyber_encaps_derand:
median: 38540 cycles/ticks
average: 38624 cycles/ticks
kyber_encaps:
median: 39800 cycles/ticks
average: 39949 cycles/ticks
kyber_decaps:
median: 43072 cycles/ticks
average: 43188 cycles/ticks
Thanks @mkannwischer, that doesn't look so bad for gcc13. 41.5k vs 43k for keygen seems fine, but it would be good to understand the larger gap for encaps and decaps.
@mkannwischer What's the performance of this repo when compiled with gcc 14.2.1
?
Have you compared compile flags with the Kyber repo?
Performance is indeed highly compiler-dependent. Some more tests on c7i
:
(venv) :~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=clang-18 CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
keypair cycles = 30418
encaps cycles = 34533
decaps cycles = 37735
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 30198 30272 30312 30353 30382 30418 30454 30497 30560 30741 31777
encaps percentiles: 34228 34326 34387 34430 34482 34533 34588 34637 34715 35115 35949
decaps percentiles: 37156 37289 37376 37486 37596 37735 37853 37983 38098 38323 39353
(venv) :~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=gcc-14 CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
keypair cycles = 31730
encaps cycles = 37722
decaps cycles = 41285
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 31013 31214 31342 31456 31602 31730 31879 32033 32268 32563 33592
encaps percentiles: 37033 37186 37276 37424 37557 37722 37996 38312 38697 39052 40037
decaps percentiles: 40484 40631 40776 40963 41138 41285 41484 41724 42017 42422 43388
(venv) ~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=gcc CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
keypair cycles = 32157
encaps cycles = 38525
decaps cycles = 42091
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 31380 31499 31652 31790 31942 32157 32576 32868 33351 33774 34998
encaps percentiles: 37577 37739 37898 38056 38255 38525 38815 39154 39547 39979 41279
decaps percentiles: 41170 41336 41527 41670 41850 42091 42398 42776 43114 43579 44869
(venv) ~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument" CC=gcc CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
keypair cycles = 43887
encaps cycles = 49830
decaps cycles = 53621
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 42850 43234 43468 43629 43758 43887 44033 44198 44488 44954 46840
encaps percentiles: 48801 49132 49352 49562 49696 49830 49988 50224 50543 51088 53141
decaps percentiles: 52527 52922 53164 53363 53500 53621 53763 53957 54331 54899 57025
The performance of clang-18
is amazing.
Note: Since #245 it's easier to test different compilers, by passing CC=COMPILER tests bench -r -c PERF
etc.
This repository reuses various AVX2 intrinsics and assembly routines from the official Kyber implementation. In other places however -- e.g. key generation or SHAKE -- we deliberately don't follow the Kyber implementation, but keep the code simpler and close to the reference implementation.
It is likely (and in my mind acceptable) that this simplicity comes at a slight loss of performance. Yet, we should still strive to be within 5-10% performance of the Kyber implementation on common x86_64 systems.