pq-code-package / mlkem-c-aarch64

ML-KEM implementation optimized for aarch64
Apache License 2.0
3 stars 5 forks source link

Gather performance data from different instance sizes #73

Open hanno-becker opened 2 weeks ago

hanno-becker commented 2 weeks ago

Relates to: https://github.com/pq-code-package/tsc/issues/75

@ryjones asks if we could use small Graviton instances for benchmarking or whether there is the need for larger ones.

Let's collect some data and discuss here.

hanno-becker commented 2 weeks ago

t4g.small

MLKEM top-level benchmarks from MLKEM-C-AArch64 (https://github.com/pq-code-package/mlkem-c-aarch64/commit/b84f0a307f869de21a3e5299653e8f5289579bea)

Building and running in nix environment. Using perf for cycle counting.

keypair cycles=104327
encaps cycles=132266
decaps cycles=172621
CRYPTO_SECRETKEYBYTES:  1632
CRYPTO_PUBLICKEYBYTES:  800
CRYPTO_CIPHERTEXTBYTES: 768
keypair cycles=179978
encaps cycles=215560
decaps cycles=269163
CRYPTO_SECRETKEYBYTES:  2400
CRYPTO_PUBLICKEYBYTES:  1184
CRYPTO_CIPHERTEXTBYTES: 1088
keypair cycles=276589
encaps cycles=317553
decaps cycles=382802
CRYPTO_SECRETKEYBYTES:  3168
CRYPTO_PUBLICKEYBYTES:  1568
CRYPTO_CIPHERTEXTBYTES: 1568

PQAX micro-benchmarks for various NTT assembly versions (https://github.com/slothy-optimizer/pqax/commit/9f7ceedcb16197e6d92448c3dd6089326b60c738)

Building and running using nix environment from MLKEM-C-AArch64 (https://github.com/pq-code-package/mlkem-c-aarch64/commit/b84f0a307f869de21a3e5299653e8f5289579bea) as above, using perf for cycle counting.

bench ntt_kyber ntt_kyber_123_4567                                1104 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load                    1268 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store              1294 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store                   1119 cycles 100 repeats
bench ntt_kyber ntt_kyber_1234_567                                 960 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567                               1058 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4                    1068 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_a55              922 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_opt_a55                         906 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_a55             959 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_a55       970 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_a55            928 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_opt_a55                       1076 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_a55             989 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_a72              860 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_opt_a72                         854 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_a72             974 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_a72       919 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_a72            837 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_opt_a72                        956 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_a72             936 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_opt_m1_firestorm                892 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_m1_firestorm    970 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_m1_firestorm 969 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_m1_firestorm     887 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_m1_firestorm   893 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_opt_m1_firestorm               977 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_m1_firestorm   1012 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_m1_icestorm      887 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_opt_m1_icestorm                 896 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_m1_icestorm    1062 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_m1_icestorm1122 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_m1_icestorm    927 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_opt_m1_icestorm                964 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_m1_icestorm     967 cycles 100 repeats
bench ntt_kyber ntt                                                968 cycles 100 repeats
bench ntt_kyber pqclean_ntt                                        872 cycles 100 repeats
bench ntt_kyber invntt                                            1062 cycles 100 repeats
bench ntt_kyber pqclean_invntt                                     983 cycles 100 repeats

c6g.4xlarge

MLKEM top-level benchmarks from MLKEM-C-AArch64 (https://github.com/pq-code-package/mlkem-c-aarch64/commit/b84f0a307f869de21a3e5299653e8f5289579bea)

Building and running in nix environment. Using perf for cycle counting.

keypair cycles=104370
encaps cycles=132276
decaps cycles=172496
CRYPTO_SECRETKEYBYTES:  1632
CRYPTO_PUBLICKEYBYTES:  800
CRYPTO_CIPHERTEXTBYTES: 768
keypair cycles=179728
encaps cycles=215611
decaps cycles=269038
CRYPTO_SECRETKEYBYTES:  2400
CRYPTO_PUBLICKEYBYTES:  1184
CRYPTO_CIPHERTEXTBYTES: 1088
keypair cycles=276640
encaps cycles=317802
decaps cycles=383131
CRYPTO_SECRETKEYBYTES:  3168
CRYPTO_PUBLICKEYBYTES:  1568
CRYPTO_CIPHERTEXTBYTES: 1568

PQAX micro-benchmarks for various NTT assembly versions (https://github.com/slothy-optimizer/pqax/commit/9f7ceedcb16197e6d92448c3dd6089326b60c738)

Building and running using nix environment from MLKEM-C-AArch64 (https://github.com/pq-code-package/mlkem-c-aarch64/commit/b84f0a307f869de21a3e5299653e8f5289579bea) as above, using perf for cycle counting.

bench ntt_kyber ntt_kyber_123_4567                                1091 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load                    1266 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store              1292 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store                   1118 cycles 100 repeats
bench ntt_kyber ntt_kyber_1234_567                                 960 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567                               1058 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4                    1068 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_a55              921 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_opt_a55                         906 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_a55             959 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_a55       969 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_a55            927 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_opt_a55                       1076 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_a55             988 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_a72              859 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_opt_a72                         853 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_a72             972 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_a72       917 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_a72            836 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_opt_a72                        956 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_a72             935 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_opt_m1_firestorm                891 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_m1_firestorm    969 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_m1_firestorm 968 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_m1_firestorm     887 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_m1_firestorm   891 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_opt_m1_firestorm               977 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_m1_firestorm   1011 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_m1_icestorm      886 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_opt_m1_icestorm                 896 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_m1_icestorm    1060 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_m1_icestorm1123 cycles 100 repeats
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_m1_icestorm    925 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_opt_m1_icestorm                964 cycles 100 repeats
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_m1_icestorm     967 cycles 100 repeats
bench ntt_kyber ntt                                                967 cycles 100 repeats
bench ntt_kyber pqclean_ntt                                        872 cycles 100 repeats
bench ntt_kyber invntt                                            1062 cycles 100 repeats
bench ntt_kyber pqclean_invntt                                     982 cycles 100 repeats
hanno-becker commented 2 weeks ago

Comparison of t4g.small (left) and c6g.4xlarge (right). As expected (since everything is single-threaded) there is no meaningful performance difference.

bench ntt_kyber ntt_kyber_123_4567                                1104 1091
bench ntt_kyber ntt_kyber_123_4567_scalar_load                    1268 1266
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store              1294 1292
bench ntt_kyber ntt_kyber_123_4567_scalar_store                   1119 1118
bench ntt_kyber ntt_kyber_1234_567                                 960  960
bench ntt_kyber intt_kyber_123_4567                               1058 1058
bench ntt_kyber intt_kyber_123_4567_manual_ld4                    1068 1068
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_a55              922  921
bench ntt_kyber ntt_kyber_123_4567_opt_a55                         906  906
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_a55             959  959
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_a55       970  969
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_a55            928  927
bench ntt_kyber intt_kyber_123_4567_opt_a55                       1076 1076
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_a55             989  988
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_a72              860  859
bench ntt_kyber ntt_kyber_123_4567_opt_a72                         854  853
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_a72             974  972
bench ntt_kyber ntt_kyber_123_4567_scalar_load_store_opt_a72       919  917
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_a72            837  836
bench ntt_kyber intt_kyber_123_4567_opt_a72                        956  956
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_a72             936  935
bench ntt_kyber ntt_kyber_123_4567_opt_m1_firestorm                892  891
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_m1_firestorm    970  969
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_m1_firestorm     887  887
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_m1_firestorm   893  891
bench ntt_kyber intt_kyber_123_4567_opt_m1_firestorm               977  977
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_m1_firestorm   1012 1011
bench ntt_kyber ntt_kyber_123_4567_manual_st4_opt_m1_icestorm      887  886
bench ntt_kyber ntt_kyber_123_4567_opt_m1_icestorm                 896  896
bench ntt_kyber ntt_kyber_123_4567_scalar_load_opt_m1_icestorm    1062 1060
bench ntt_kyber ntt_kyber_123_4567_scalar_store_opt_m1_icestorm    927  925
bench ntt_kyber intt_kyber_123_4567_opt_m1_icestorm                964  964
bench ntt_kyber intt_kyber_123_4567_manual_ld4_opt_m1_icestorm     967  967
bench ntt_kyber ntt                                                968  967
bench ntt_kyber pqclean_ntt                                        872  872
bench ntt_kyber invntt                                            1062 1062
bench ntt_kyber pqclean_invntt                                     983  982

keypair cycles=104327 104370
encaps cycles= 132266 132276
decaps cycles= 172621 172496
keypair cycles=179978 179728
encaps cycles= 215560 215611
decaps cycles= 269163 269038
keypair cycles=276589 276640
encaps cycles= 317553 317802
decaps cycles= 382802 383131
ryjones commented 2 weeks ago

@hanno-becker if you could try this on a fork and get the configuration set how you like it using your credentials, when I have those credentials, I'll have an easier lift.

hanno-becker commented 2 weeks ago

@ryjones I'll look into it, though unlikely before Monday or Tuesday.