hanno-becker commented 1 month ago

This repository reuses various AVX2 intrinsics and assembly routines from the official Kyber implementation. In other places however -- e.g. key generation or SHAKE -- we deliberately don't follow the Kyber implementation, but keep the code simpler and close to the reference implementation.

It is likely (and in my mind acceptable) that this simplicity comes at a slight loss of performance. Yet, we should still strive to be within 5-10% performance of the Kyber implementation on common x86_64 systems.

mkannwischer commented 1 month ago

Just to give a snapshot: Currently (https://github.com/pq-code-package/mlkem-c-aarch64/commit/e8d3246ff1f01a5eaacd23fa5873dbd0a7c58c64), I get the following on my Raptor Lake:

INFO  > Benchmark
INFO  > make CROSS_PREFIX= bench CYCLES=PERF OPT=1 AUTO=1
INFO  > ./test/build/mlkem512/bin/bench_kyber512
INFO  > ML-KEM-512
INFO  > 
   keypair cycles = 24303
    encaps cycles = 31453
    decaps cycles = 34247

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  23778  23885  24023  24127  24215  24303  24401  24465  24611  24887  26488
    encaps percentiles:  30917  31028  31176  31283  31373  31453  31538  31635  31831  32168  33534
    decaps percentiles:  33699  33824  33942  34036  34161  34247  34320  34416  34572  34982  36435

INFO  > ./test/build/mlkem768/bin/bench_kyber768
INFO  > ML-KEM-768
INFO  > 
   keypair cycles = 42976
    encaps cycles = 47622
    decaps cycles = 51491

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  42315  42683  42798  42862  42913  42976  43053  43154  43286  43686  45460
    encaps percentiles:  47034  47323  47420  47485  47542  47622  47681  47833  48000  48495  50133
    decaps percentiles:  50916  51164  51278  51353  51412  51491  51568  51671  51824  52348  53910

INFO  > ./test/build/mlkem1024/bin/bench_kyber1024
INFO  > ML-KEM-1024
INFO  > 
   keypair cycles = 62550
    encaps cycles = 73023
    decaps cycles = 78353

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  60867  61466  61767  62056  62329  62550  62771  63073  63350  63960  65614
    encaps percentiles:  71320  71820  72151  72464  72699  73023  73223  73496  73825  74315  76631
    decaps percentiles:  76730  77213  77591  77857  78058  78353  78567  78853  79173  79836  82058

While I get this for the Kyber repo (using the same compiler -- gcc 13.2.0):

# ML-KEM-512
kyber_keypair_derand: 
median: 25028 cycles/ticks
average: 25072 cycles/ticks

kyber_keypair: 
median: 27038 cycles/ticks
average: 27188 cycles/ticks

kyber_encaps_derand: 
median: 27322 cycles/ticks
average: 27358 cycles/ticks

kyber_encaps: 
median: 28602 cycles/ticks
average: 28521 cycles/ticks

kyber_decaps: 
median: 30138 cycles/ticks
average: 30166 cycles/ticks

# ML-KEM-768
kyber_keypair_derand: 
median: 41366 cycles/ticks
average: 41511 cycles/ticks

kyber_keypair: 
median: 44376 cycles/ticks
average: 44653 cycles/ticks

kyber_encaps_derand: 
median: 42050 cycles/ticks
average: 42136 cycles/ticks

kyber_encaps: 
median: 43562 cycles/ticks
average: 43721 cycles/ticks

kyber_decaps: 
median: 46922 cycles/ticks
average: 47014 cycles/ticks

# ML-KEM-1024

kyber_keypair_derand: 
median: 58046 cycles/ticks
average: 58158 cycles/ticks

kyber_keypair: 
median: 62010 cycles/ticks
average: 62433 cycles/ticks

kyber_encaps_derand: 
median: 60766 cycles/ticks
average: 60901 cycles/ticks

kyber_encaps: 
median: 62212 cycles/ticks
average: 62254 cycles/ticks

kyber_decaps: 
median: 66882 cycles/ticks
average: 66990 cycles/ticks

mkannwischer commented 1 month ago

Note that with gcc 14.2.1 I get much better performance for the Kyber repo.

# ML-KEM-768 
kyber_keypair_derand: 
median: 37936 cycles/ticks
average: 38033 cycles/ticks

kyber_keypair: 
median: 40764 cycles/ticks
average: 41413 cycles/ticks

kyber_encaps_derand: 
median: 38540 cycles/ticks
average: 38624 cycles/ticks

kyber_encaps: 
median: 39800 cycles/ticks
average: 39949 cycles/ticks

kyber_decaps: 
median: 43072 cycles/ticks
average: 43188 cycles/ticks

hanno-becker commented 1 month ago

Thanks @mkannwischer, that doesn't look so bad for gcc13. 41.5k vs 43k for keygen seems fine, but it would be good to understand the larger gap for encaps and decaps.

hanno-becker commented 1 month ago

@mkannwischer What's the performance of this repo when compiled with gcc 14.2.1?

Have you compared compile flags with the Kyber repo?

hanno-becker commented 1 month ago

Performance is indeed highly compiler-dependent. Some more tests on c7i:

(venv) :~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=clang-18 CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
   keypair cycles = 30418
    encaps cycles = 34533
    decaps cycles = 37735

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  30198  30272  30312  30353  30382  30418  30454  30497  30560  30741  31777
    encaps percentiles:  34228  34326  34387  34430  34482  34533  34588  34637  34715  35115  35949
    decaps percentiles:  37156  37289  37376  37486  37596  37735  37853  37983  38098  38323  39353
(venv) :~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=gcc-14 CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
   keypair cycles = 31730
    encaps cycles = 37722
    decaps cycles = 41285

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  31013  31214  31342  31456  31602  31730  31879  32033  32268  32563  33592
    encaps percentiles:  37033  37186  37276  37424  37557  37722  37996  38312  38697  39052  40037
    decaps percentiles:  40484  40631  40776  40963  41138  41285  41484  41724  42017  42422  43388
(venv) ~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument -march=native -mtune=native" CC=gcc CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
   keypair cycles = 32157
    encaps cycles = 38525
    decaps cycles = 42091

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  31380  31499  31652  31790  31942  32157  32576  32868  33351  33774  34998
    encaps percentiles:  37577  37739  37898  38056  38255  38525  38815  39154  39547  39979  41279
    decaps percentiles:  41170  41336  41527  41670  41850  42091  42398  42776  43114  43579  44869
(venv) ~/mlkem-c-aarch64$ make clean && CFLAGS="-Wno-unused-command-line-argument" CC=gcc CYCLES=PERF make bench -j8 >/dev/null && sudo ./test/build/mlkem768/bin/bench_kyber768
rm -f -rf *.gcno *.gcda *.lcov *.o *.so
rm -f -rf test/build
   keypair cycles = 43887
    encaps cycles = 49830
    decaps cycles = 53621

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  42850  43234  43468  43629  43758  43887  44033  44198  44488  44954  46840
    encaps percentiles:  48801  49132  49352  49562  49696  49830  49988  50224  50543  51088  53141
    decaps percentiles:  52527  52922  53164  53363  53500  53621  53763  53957  54331  54899  57025

hanno-becker commented 1 month ago

The performance of clang-18 is amazing.

hanno-becker commented 1 month ago

Note: Since #245 it's easier to test different compilers, by passing CC=COMPILER tests bench -r -c PERF etc.

hanno-becker commented 3 weeks ago

@mkannwischer Can you take care of obtaining an apples to apples comparison between this repo and the official Kyber implementation (in terms of platform, compiler, options, but also version of the standard implemented)?

mkannwischer commented 3 weeks ago

Yes, I can take care of that later this week.

mkannwischer commented 2 weeks ago

The following functions are written in intrinics in the kyber repo, but we are using the reference code:

poly{,vec}_compress + poly{,vec}_decompress
poly_tomsg + poly_frommsg
cbd
verify
cmov
gen_matrix
Keccak absorb + squeeze

mkannwischer commented 2 weeks ago

I've done some more benchmarks on my 13th Gen Intel i7-1360P (Raptor Lake) with TurboBoost and Hyperthreading disabled. I'm using gcc 14.2.1 from the Arch Linux repo. For the official AVX2 I added the input validation check_pk and check_sk that are not yet part of the official code -- that latter is quite costly as it involves hashing the public key.

We are currently still up to 22% slower than the state of the art (for 1024 encaps)

I tracked down a large portion of the difference to polyvec_compress + polyvec_decompress which is implemented in AVX2 intrinsics in the official code, but the reference C code in our code. I hacked that code into our code here and the resulting performance is within 12% of the state of the art.

Detailed benchmarks are below, but here is the summary:

part	Our code	Kyber repo	Our code (+polyvec_{,de}compress)
512 kg	22083	22348	23413
512 enc	28612	24868	27760
512 dec	36365	34984	34259
768 kg	39361	38070	39594
768 enc	44688	39056	41950
768 dec	55931	53726	51364
1024 kg	61090	53532	59475
1024 enc	68907	56698	63636
1024 dec	84417	75874	76427

Our code cd3b9de368c6367b0c91abcaa5abeefd54bbba9f

$ ./scripts/tests bench --cycles PERF --cflags="-march=native -mtune=native"
INFO  > Benchmark
INFO  > CFLAGS=-march=native -mtune=native make CROSS_PREFIX= bench CYCLES=PERF OPT=1 AUTO=1
INFO  > ./test/build/mlkem512/bin/bench_mlkem512
INFO  > ML-KEM-512
INFO  > 
   keypair cycles = 22083
    encaps cycles = 28612
    decaps cycles = 36365

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  21709  21808  21871  21919  21991  22083  22207  22337  22537  22776  24339
    encaps percentiles:  28043  28146  28240  28317  28456  28612  28755  28888  29073  29338  30824
    decaps percentiles:  35672  35879  36009  36123  36259  36365  36504  36683  36864  37105  38688

INFO  > ./test/build/mlkem768/bin/bench_mlkem768
INFO  > ML-KEM-768
INFO  > 
   keypair cycles = 39361
    encaps cycles = 44688
    decaps cycles = 55931

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  38523  38734  38854  39008  39154  39361  39628  39821  40188  40841  42343
    encaps percentiles:  43788  44015  44154  44330  44495  44688  44921  45163  45524  46234  47786
    decaps percentiles:  55056  55289  55430  55585  55740  55931  56170  56397  56709  57490  59079

INFO  > ./test/build/mlkem1024/bin/bench_mlkem1024
INFO  > ML-KEM-1024
INFO  > 
   keypair cycles = 61090
    encaps cycles = 68907
    decaps cycles = 84417

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  58949  59640  60072  60374  60716  61090  61462  61895  62366  63060  64967
    encaps percentiles:  66868  67596  67960  68330  68629  68907  69181  69691  70216  70860  72781
    decaps percentiles:  82240  83102  83526  83825  84127  84417  84766  85202  85805  86488  88258`

Kyber repo (with added check_pk, check_sk) 10b478fc3cc4ff6215eb0b6a11bd758bf0929cbd

MLKEM512
kyber_keypair_derand: 
median: 22348 cycles/ticks
average: 22429 cycles/ticks

kyber_keypair: 
median: 24632 cycles/ticks
average: 24753 cycles/ticks

kyber_encaps_derand: 
median: 24868 cycles/ticks
average: 24944 cycles/ticks

kyber_encaps: 
median: 26348 cycles/ticks
average: 26378 cycles/ticks

kyber_decaps: 
median: 34984 cycles/ticks
average: 35027 cycles/ticks

MLKEM768
kyber_keypair_derand: 
median: 38070 cycles/ticks
average: 38143 cycles/ticks

kyber_keypair: 
median: 41074 cycles/ticks
average: 41328 cycles/ticks

kyber_encaps_derand: 
median: 39056 cycles/ticks
average: 39170 cycles/ticks

kyber_encaps: 
median: 40390 cycles/ticks
average: 40463 cycles/ticks

kyber_decaps: 
median: 53726 cycles/ticks
average: 53797 cycles/ticks

MLKEM1024 
kyber_keypair_derand: 
median: 53532 cycles/ticks
average: 53671 cycles/ticks

kyber_keypair: 
median: 57230 cycles/ticks
average: 57599 cycles/ticks

kyber_encaps_derand: 
median: 56698 cycles/ticks
average: 56835 cycles/ticks

kyber_encaps: 
median: 58246 cycles/ticks
average: 58360 cycles/ticks

kyber_decaps: 
median: 75874 cycles/ticks
average: 76011 cycles/ticks

Our code (with added AVX2 polyvec_compress + polyvec_decompress)

$ ./scripts/tests bench --cycles PERF --cflags="-march=native -mtune=native"
INFO  > Benchmark
INFO  > CFLAGS=-march=native -mtune=native make CROSS_PREFIX= bench CYCLES=PERF AUTO=1 OPT=1
INFO  > ./test/build/mlkem512/bin/bench_mlkem512
INFO  > ML-KEM-512
INFO  > 
   keypair cycles = 23413
    encaps cycles = 27760
    decaps cycles = 34259

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  22257  22695  22915  23069  23244  23413  23551  23702  23875  24248  25594
    encaps percentiles:  26728  27085  27242  27424  27581  27760  27879  28095  28414  28899  30232
    decaps percentiles:  33240  33584  33769  33928  34075  34259  34383  34599  34868  35259  36533

INFO  > ./test/build/mlkem768/bin/bench_mlkem768
INFO  > ML-KEM-768
INFO  > 
   keypair cycles = 39594
    encaps cycles = 41950
    decaps cycles = 51364

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  38697  38987  39138  39277  39454  39594  39839  40132  40642  41337  43056
    encaps percentiles:  41130  41400  41541  41644  41803  41950  42218  42654  43140  43835  46026
    decaps percentiles:  50457  50745  50869  51009  51139  51364  51652  52007  52493  53194  55418

INFO  > ./test/build/mlkem1024/bin/bench_mlkem1024
INFO  > ML-KEM-1024
INFO  > 
   keypair cycles = 59475
    encaps cycles = 63636
    decaps cycles = 76427

           percentile      1     10     20     30     40     50     60     70     80     90     99
   keypair percentiles:  57882  58329  58697  58955  59221  59475  59782  60125  60642  61473  63339
    encaps percentiles:  62080  62572  62809  63091  63360  63636  63925  64336  64732  65432  67883
    decaps percentiles:  74899  75347  75650  75931  76155  76427  76763  77141  77671  78477  80486

pq-code-package / mlkem-native

Achieve competitive performance on AVX2 compared to official Kyber implementation #224

Our code cd3b9de368c6367b0c91abcaa5abeefd54bbba9f

Kyber repo (with added check_pk, check_sk) 10b478fc3cc4ff6215eb0b6a11bd758bf0929cbd

Our code (with added AVX2 polyvec_compress + polyvec_decompress)