Open hanno-becker opened 1 month ago
Just to give a snapshot: Currently (https://github.com/pq-code-package/mlkem-c-aarch64/commit/e8d3246ff1f01a5eaacd23fa5873dbd0a7c58c64), I get the following on my Raptor Lake:
INFO > Benchmark
INFO > make CROSS_PREFIX= bench CYCLES=PERF OPT=1 AUTO=1
INFO > ./test/build/mlkem512/bin/bench_kyber512
INFO > ML-KEM-512
INFO >
keypair cycles = 24303
encaps cycles = 31453
decaps cycles = 34247
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 23778 23885 24023 24127 24215 24303 24401 24465 24611 24887 26488
encaps percentiles: 30917 31028 31176 31283 31373 31453 31538 31635 31831 32168 33534
decaps percentiles: 33699 33824 33942 34036 34161 34247 34320 34416 34572 34982 36435
INFO > ./test/build/mlkem768/bin/bench_kyber768
INFO > ML-KEM-768
INFO >
keypair cycles = 42976
encaps cycles = 47622
decaps cycles = 51491
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 42315 42683 42798 42862 42913 42976 43053 43154 43286 43686 45460
encaps percentiles: 47034 47323 47420 47485 47542 47622 47681 47833 48000 48495 50133
decaps percentiles: 50916 51164 51278 51353 51412 51491 51568 51671 51824 52348 53910
INFO > ./test/build/mlkem1024/bin/bench_kyber1024
INFO > ML-KEM-1024
INFO >
keypair cycles = 62550
encaps cycles = 73023
decaps cycles = 78353
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 60867 61466 61767 62056 62329 62550 62771 63073 63350 63960 65614
encaps percentiles: 71320 71820 72151 72464 72699 73023 73223 73496 73825 74315 76631
decaps percentiles: 76730 77213 77591 77857 78058 78353 78567 78853 79173 79836 82058
While I get this for the Kyber repo (using the same compiler -- gcc 13.2.0):
# ML-KEM-512
kyber_keypair_derand:
median: 25028 cycles/ticks
average: 25072 cycles/ticks
kyber_keypair:
median: 27038 cycles/ticks
average: 27188 cycles/ticks
kyber_encaps_derand:
median: 27322 cycles/ticks
average: 27358 cycles/ticks
kyber_encaps:
median: 28602 cycles/ticks
average: 28521 cycles/ticks
kyber_decaps:
median: 30138 cycles/ticks
average: 30166 cycles/ticks
# ML-KEM-768
kyber_keypair_derand:
median: 41366 cycles/ticks
average: 41511 cycles/ticks
kyber_keypair:
median: 44376 cycles/ticks
average: 44653 cycles/ticks
kyber_encaps_derand:
median: 42050 cycles/ticks
average: 42136 cycles/ticks
kyber_encaps:
median: 43562 cycles/ticks
average: 43721 cycles/ticks
kyber_decaps:
median: 46922 cycles/ticks
average: 47014 cycles/ticks
# ML-KEM-1024
kyber_keypair_derand:
median: 58046 cycles/ticks
average: 58158 cycles/ticks
kyber_keypair:
median: 62010 cycles/ticks
average: 62433 cycles/ticks
kyber_encaps_derand:
median: 60766 cycles/ticks
average: 60901 cycles/ticks
kyber_encaps:
median: 62212 cycles/ticks
average: 62254 cycles/ticks
kyber_decaps:
median: 66882 cycles/ticks
average: 66990 cycles/ticks
Note that with gcc 14.2.1 I get much better performance for the Kyber repo.
# ML-KEM-768
kyber_keypair_derand:
median: 37936 cycles/ticks
average: 38033 cycles/ticks
kyber_keypair:
median: 40764 cycles/ticks
average: 41413 cycles/ticks
kyber_encaps_derand:
median: 38540 cycles/ticks
average: 38624 cycles/ticks
kyber_encaps:
median: 39800 cycles/ticks
average: 39949 cycles/ticks
kyber_decaps:
median: 43072 cycles/ticks
average: 43188 cycles/ticks
Thanks @mkannwischer, that doesn't look so bad for gcc13. 41.5k vs 43k for keygen seems fine, but it would be good to understand the larger gap for encaps and decaps.
@mkannwischer What's the performance of this repo when compiled with gcc 14.2.1
?
Have you compared compile flags with the Kyber repo?
Performance is indeed highly compiler-dependent. Some more tests on c7i
:
(venv) :~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=clang-18 CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
keypair cycles = 30418
encaps cycles = 34533
decaps cycles = 37735
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 30198 30272 30312 30353 30382 30418 30454 30497 30560 30741 31777
encaps percentiles: 34228 34326 34387 34430 34482 34533 34588 34637 34715 35115 35949
decaps percentiles: 37156 37289 37376 37486 37596 37735 37853 37983 38098 38323 39353
(venv) :~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=gcc-14 CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
keypair cycles = 31730
encaps cycles = 37722
decaps cycles = 41285
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 31013 31214 31342 31456 31602 31730 31879 32033 32268 32563 33592
encaps percentiles: 37033 37186 37276 37424 37557 37722 37996 38312 38697 39052 40037
decaps percentiles: 40484 40631 40776 40963 41138 41285 41484 41724 42017 42422 43388
(venv) ~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=gcc CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
keypair cycles = 32157
encaps cycles = 38525
decaps cycles = 42091
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 31380 31499 31652 31790 31942 32157 32576 32868 33351 33774 34998
encaps percentiles: 37577 37739 37898 38056 38255 38525 38815 39154 39547 39979 41279
decaps percentiles: 41170 41336 41527 41670 41850 42091 42398 42776 43114 43579 44869
(venv) ~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument" CC=gcc CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
keypair cycles = 43887
encaps cycles = 49830
decaps cycles = 53621
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 42850 43234 43468 43629 43758 43887 44033 44198 44488 44954 46840
encaps percentiles: 48801 49132 49352 49562 49696 49830 49988 50224 50543 51088 53141
decaps percentiles: 52527 52922 53164 53363 53500 53621 53763 53957 54331 54899 57025
The performance of clang-18
is amazing.
Note: Since #245 it's easier to test different compilers, by passing CC=COMPILER tests bench -r -c PERF
etc.
@mkannwischer Can you take care of obtaining an apples to apples comparison between this repo and the official Kyber implementation (in terms of platform, compiler, options, but also version of the standard implemented)?
Yes, I can take care of that later this week.
The following functions are written in intrinics in the kyber repo, but we are using the reference code:
I've done some more benchmarks on my 13th Gen Intel i7-1360P (Raptor Lake) with TurboBoost and Hyperthreading disabled.
I'm using gcc 14.2.1 from the Arch Linux repo.
For the official AVX2 I added the input validation check_pk
and check_sk
that are not yet part of the official code -- that latter is quite costly as it involves hashing the public key.
We are currently still up to 22% slower than the state of the art (for 1024 encaps)
I tracked down a large portion of the difference to polyvec_compress + polyvec_decompress which is implemented in AVX2 intrinsics in the official code, but the reference C code in our code. I hacked that code into our code here and the resulting performance is within 12% of the state of the art.
Detailed benchmarks are below, but here is the summary:
part | Our code | Kyber repo | Our code (+polyvec_{,de}compress) |
---|---|---|---|
512 kg | 22083 | 22348 | 23413 |
512 enc | 28612 | 24868 | 27760 |
512 dec | 36365 | 34984 | 34259 |
768 kg | 39361 | 38070 | 39594 |
768 enc | 44688 | 39056 | 41950 |
768 dec | 55931 | 53726 | 51364 |
1024 kg | 61090 | 53532 | 59475 |
1024 enc | 68907 | 56698 | 63636 |
1024 dec | 84417 | 75874 | 76427 |
$ ./scripts/tests bench --cycles PERF --cflags="-march=native -mtune=native"
INFO > Benchmark
INFO > CFLAGS=-march=native -mtune=native make CROSS_PREFIX= bench CYCLES=PERF OPT=1 AUTO=1
INFO > ./test/build/mlkem512/bin/bench_mlkem512
INFO > ML-KEM-512
INFO >
keypair cycles = 22083
encaps cycles = 28612
decaps cycles = 36365
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 21709 21808 21871 21919 21991 22083 22207 22337 22537 22776 24339
encaps percentiles: 28043 28146 28240 28317 28456 28612 28755 28888 29073 29338 30824
decaps percentiles: 35672 35879 36009 36123 36259 36365 36504 36683 36864 37105 38688
INFO > ./test/build/mlkem768/bin/bench_mlkem768
INFO > ML-KEM-768
INFO >
keypair cycles = 39361
encaps cycles = 44688
decaps cycles = 55931
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 38523 38734 38854 39008 39154 39361 39628 39821 40188 40841 42343
encaps percentiles: 43788 44015 44154 44330 44495 44688 44921 45163 45524 46234 47786
decaps percentiles: 55056 55289 55430 55585 55740 55931 56170 56397 56709 57490 59079
INFO > ./test/build/mlkem1024/bin/bench_mlkem1024
INFO > ML-KEM-1024
INFO >
keypair cycles = 61090
encaps cycles = 68907
decaps cycles = 84417
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 58949 59640 60072 60374 60716 61090 61462 61895 62366 63060 64967
encaps percentiles: 66868 67596 67960 68330 68629 68907 69181 69691 70216 70860 72781
decaps percentiles: 82240 83102 83526 83825 84127 84417 84766 85202 85805 86488 88258`
MLKEM512
kyber_keypair_derand:
median: 22348 cycles/ticks
average: 22429 cycles/ticks
kyber_keypair:
median: 24632 cycles/ticks
average: 24753 cycles/ticks
kyber_encaps_derand:
median: 24868 cycles/ticks
average: 24944 cycles/ticks
kyber_encaps:
median: 26348 cycles/ticks
average: 26378 cycles/ticks
kyber_decaps:
median: 34984 cycles/ticks
average: 35027 cycles/ticks
MLKEM768
kyber_keypair_derand:
median: 38070 cycles/ticks
average: 38143 cycles/ticks
kyber_keypair:
median: 41074 cycles/ticks
average: 41328 cycles/ticks
kyber_encaps_derand:
median: 39056 cycles/ticks
average: 39170 cycles/ticks
kyber_encaps:
median: 40390 cycles/ticks
average: 40463 cycles/ticks
kyber_decaps:
median: 53726 cycles/ticks
average: 53797 cycles/ticks
MLKEM1024
kyber_keypair_derand:
median: 53532 cycles/ticks
average: 53671 cycles/ticks
kyber_keypair:
median: 57230 cycles/ticks
average: 57599 cycles/ticks
kyber_encaps_derand:
median: 56698 cycles/ticks
average: 56835 cycles/ticks
kyber_encaps:
median: 58246 cycles/ticks
average: 58360 cycles/ticks
kyber_decaps:
median: 75874 cycles/ticks
average: 76011 cycles/ticks
$ ./scripts/tests bench --cycles PERF --cflags="-march=native -mtune=native"
INFO > Benchmark
INFO > CFLAGS=-march=native -mtune=native make CROSS_PREFIX= bench CYCLES=PERF AUTO=1 OPT=1
INFO > ./test/build/mlkem512/bin/bench_mlkem512
INFO > ML-KEM-512
INFO >
keypair cycles = 23413
encaps cycles = 27760
decaps cycles = 34259
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 22257 22695 22915 23069 23244 23413 23551 23702 23875 24248 25594
encaps percentiles: 26728 27085 27242 27424 27581 27760 27879 28095 28414 28899 30232
decaps percentiles: 33240 33584 33769 33928 34075 34259 34383 34599 34868 35259 36533
INFO > ./test/build/mlkem768/bin/bench_mlkem768
INFO > ML-KEM-768
INFO >
keypair cycles = 39594
encaps cycles = 41950
decaps cycles = 51364
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 38697 38987 39138 39277 39454 39594 39839 40132 40642 41337 43056
encaps percentiles: 41130 41400 41541 41644 41803 41950 42218 42654 43140 43835 46026
decaps percentiles: 50457 50745 50869 51009 51139 51364 51652 52007 52493 53194 55418
INFO > ./test/build/mlkem1024/bin/bench_mlkem1024
INFO > ML-KEM-1024
INFO >
keypair cycles = 59475
encaps cycles = 63636
decaps cycles = 76427
percentile 1 10 20 30 40 50 60 70 80 90 99
keypair percentiles: 57882 58329 58697 58955 59221 59475 59782 60125 60642 61473 63339
encaps percentiles: 62080 62572 62809 63091 63360 63636 63925 64336 64732 65432 67883
decaps percentiles: 74899 75347 75650 75931 76155 76427 76763 77141 77671 78477 80486
This repository reuses various AVX2 intrinsics and assembly routines from the official Kyber implementation. In other places however -- e.g. key generation or SHAKE -- we deliberately don't follow the Kyber implementation, but keep the code simpler and close to the reference implementation.
It is likely (and in my mind acceptable) that this simplicity comes at a slight loss of performance. Yet, we should still strive to be within 5-10% performance of the Kyber implementation on common x86_64 systems.