(Rust binding) Repeated invocation of EltwiseFMAModAVX512 (with different data) in loop has unexpected performance regression

Janmajayamall commented 1 year ago

I am weiting rust bindings for hexl here. I have added support for NTT operations and some elwise operations. However, I am running into issues with elwise operations with prime (ie q) set to 50 bits. To see what's wrong you can clone the repository and run cargo bench modulus/elwise_fma_mod. This will run benches inside benches/modulus.rs with prefix elwise_fma_mod which uses EltwiseFMAModAVX512 internally and will produce following looking output

modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=1
                        time:   [40.942 µs 40.978 µs 41.017 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=3
                        time:   [122.65 µs 122.72 µs 122.80 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=5
                        time:   [205.28 µs 205.52 µs 205.76 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=15
                        time:   [616.00 µs 616.57 µs 617.19 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=1
                        time:   [9.6013 µs 9.6061 µs 9.6115 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=3
                        time:   [27.549 µs 27.647 µs 27.770 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=5
                        time:   [81.501 µs 81.550 µs 81.607 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=15
                        time:   [284.54 µs 287.81 µs 291.38 µs]

I have reduced the output to only necessary items: bench name and time.

bench modulus/elwise_fma_mod_2d/* benches this function. The function simply takes two 2-dimensional (row-major) matrix r0, r1, and a scalar and calls elwise_fma_mod row-wise. elwise_fma_mod internally calls EltwiseFMAModAVX512 here.

n is row size, fixed at 32768. logq is bits in prime and mod_size is no. of rows in matrix. For example, modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=1 calls elwise_fma_mod once (since it has only 1 row) with a 60 bit prime and vector size 32768 and modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=3 calls elwise_fma_mod thrice for 3 different rows (since mod_size is 3) with rest of parameters set to same. Hence we must expect performance of modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=3 to be around 3x of modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=1. Indeed it is. Same holds for other benches with n=32768 and logq=60 and mod_size=5 / 15.

But things behave differently when logq is set to 50 bits (ie when EltwiseFMAModAVX512 uses IFMA instead of DQ). modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=3 is 3x of modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=1 as expected, but same pattern does not holds when mod_size is either 5 or 15 (for mod_size=5 it should be around 50µs but is 81µs and for mod_size=15 it should be 145µs but is 287µs). I have tried for other mod_sizes and it gets worse as mod_size increases, that is as no. of rows increase.

I am unable to detect what causes this for 50 bit primes. Do you have any pointers? Or is this expected with IFMA?

Thanks!

joserochh commented 1 year ago

Hello @Janmajayamall. Unfortunately I no longer have the machines to run HEXL at full (Using AVX512). I can tell you modular reduction works different depending on BitShift variable.

Look at functions on fma_mod that depends on BitShift here: https://github.com/intel/hexl/blob/development/hexl/eltwise/eltwise-fma-mod-avx512.cpp

BitShift definition happens here https://github.com/intel/hexl/blob/development/hexl/eltwise/eltwise-fma-mod.cpp

Would you have the same behavior using logq = 48 or 46? just curious.

Regards, José Rojas

faberga commented 1 year ago

Hi @Janmajayamall,

You mentioned that you are trying to use the Intel Advanced Vector Extensions 512 Integer Fused Multiply Add (AVX512-IFMA52) instructions. These were introduced in the 3rd Gen Intel® Xeon® Scalable Processors (and onwards), so checking which CPU manufacturer and type you are using will be important.

The AVX512-IFMA52 should only be used for primes below 50–52 bits, assuming it suffices for your computation.

For more information on how HEXL uses the AVX512-IFMA52, please refer to: https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-intel-hexl.html

and

https://arxiv.org/pdf/2103.16400.pdf

Regards, Flavio

faberga commented 1 year ago

@Janmajayamall A description of the AVX512-IFMA52 intrinsics can be found here: https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avx512techs=AVX512IFMA52&cats=Arithmetic

Janmajayamall commented 1 year ago

Would you have the same behavior using logq = 48 or 46? just curious.

modulus/elwise_fma_mod_2d/n=32768/logq=48/mod_size=1
                        time:   [9.2788 µs 9.3188 µs 9.3523 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=48/mod_size=3
                        time:   [28.762 µs 28.882 µs 28.987 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=48/mod_size=5
                        time:   [76.355 µs 76.662 µs 76.946 µs]

modulus/elwise_fma_mod_2d/n=32768/logq=48/mod_size=15
                        time:   [273.11 µs 276.14 µs 279.43 µs]

Yeah it behaves same for logq=48 and can confirm same for logq=46.

I don't suspect that this is due to calling code from rust (but will still compare by implementing same in C++).

If I understand correctly the line here sets Bitshift value to 52 and uses IFMA, right?

You mentioned that you are trying to use the Intel Advanced Vector Extensions 512 Integer Fused Multiply Add (AVX512-IFMA52) instructions. These were introduced in the 3rd Gen Intel® Xeon® Scalable Processors (and onwards), so checking which CPU manufacturer and type you are using will be important.

I am using C3 machine on GCP (4th Gen Intel Xeon Scalable processor) that supports AVX512-IFMA. I don't think there are additional configs I need to enable for hexl, or am I missing something?

I am curious whether you have some ideas around what can cause this?

Thanks!

faberga commented 1 year ago

Hi @Janmajayamall

The 4th Gen Intel Xeon Scalable processor does support AVX512-IFMA instructions. But, just in case, assuming you are using Linux, can you check with the command "lscpu".

As far as how to make use of HEXL in an FHE library, I would suggest you study the integration of HEXL with MS SEAL and/or with OpenFHE.

intel / hexl

(Rust binding) Repeated invocation of EltwiseFMAModAVX512 (with different data) in loop has unexpected performance regression #143