marekandreas / elpa

A scalable eigensolver for dense, symmetric (hermitian) matrices (fork of https://gitlab.mpcdf.mpg.de/elpa/elpa.git)
Other
25 stars 11 forks source link

Undefined variable _XOR_EPI breaks builds with AVX512 on non Intel CPUs #39

Closed maxim-masterov closed 8 months ago

maxim-masterov commented 11 months ago

Hi, I found this issue when I was building ELPA on AMD zen4 architecture with AVX512. The build broke with multiple errors like:

src/elpa2/kernels/complex_128bit_256bit_512bit_BLOCK_template.c:2574:37: warning: implicit declaration of function '_XOR_EPI' [-Wimplicit-function-declaration]
 2574 |         h1_real = (__SIMD_DATATYPE) _XOR_EPI((__m512i) h1_real, (__m512i) sign);
      |                                     ^~~~~~~~
src/elpa2/kernels/complex_128bit_256bit_512bit_BLOCK_template.c:2574:9: error: cannot convert a value of type 'int' to vector type '__vector(8) double' which has different size
 2574 |         h1_real = (__SIMD_DATATYPE) _XOR_EPI((__m512i) h1_real, (__m512i) sign);

According to 'configure.ac', either 'HAVE_AVX512_XEON' or 'HAVE_AVX512_XEON_PHI' macro should automatically be defined if CPU has an AVX512 support. This works on Intel CPUs, however, on AMD CPUs the configure script ill-defines HAVE_AVX512_XEON_PHI and leaves HAVE_AVX512_XEON undefined. This leads to execution of the code path guarded by the #ifdef HAVE_AVX512_XEON_PHI directive, in which an undefined macro _XOR_EPI is used. I'm sure that the same error will also appear on Xeon-Phi as the logic similar.

If I understood correctly, the _XOR_EPI macro should actually be replaced with _SIMD_XOR_EPI that correctly defines _mm512_xor_epi64 for AVX512 registers.

I'm attaching the patch with the fix to this issue (created from ELPA-2023.05.001): ELPA-2023.05.001_fix_AVX512_support.patch

terminationshock commented 11 months ago

Hi @maxim-masterov ,

on which AMD CPU are you trying to compile ELPA and which compiler do you use? We do not see this issue on AMD EPYC 9654 with the Intel compiler.

maxim-masterov commented 11 months ago

Hi @terminationshock , The CPU is AMD EPYC 9654 96-Core Processor and the compiler I use is GCC-11.3.0.

The error occurs because GCC11 doesn't support zen4 architecture. As a result, the evaluation of existence of AVX512 instructions by the configure.ac script results in the following output:

checking whether we compile for Xeon... no
checking whether we compile for Xeon PHI... yes

This leads to execution of this branch in the configure.ac script: https://github.com/marekandreas/elpa/blob/c394aedd3adaa998a88dd4a09eabe53913fb5dea/configure.ac#L2430-L2435 which defines the HAVE_AVX512_XEON_PHI variable. As a result, the following part of the source code is getting executed: https://github.com/marekandreas/elpa/blob/c394aedd3adaa998a88dd4a09eabe53913fb5dea/src/elpa2/kernels/complex_128bit_256bit_512bit_BLOCK_template.c#L2574-L2575

The _XOR_EPI macro is undefined (or I haven't found a place with its definition). So, the above snippet of code is getting converted to a casting of an integer into a packed double instead of invocation of the XOR operation.

maxim-masterov commented 11 months ago

I think something happened between the two releases 2020.05.001 and 2020.11.001

In 2020.05.001 the XOR operation looks correct: https://github.com/marekandreas/elpa/blob/907b0333414272b4d99d3c136055a4e458542dda/src/elpa2/kernels/complex_128bit_256bit_512bit_BLOCK_template.c#L1729-L1732 whereas in 2020.11.001 the _SIMD prefix is removed: https://github.com/marekandreas/elpa/blob/e676706751a10e2748c37a81f9711bc0a2a06c8c/src/elpa2/kernels/complex_128bit_256bit_512bit_BLOCK_template.c#L2573-L2576

terminationshock commented 11 months ago

I tested compiling with different march settings with the GCC 13.1 compiler (the one I had easily available on that AMD system): I can indeed reproduce the problem with -march=knl and -march=knm. However, I do not see it with -march=skylake-avx512 or (of course) -march=znver4. @maxim-masterov could you please tell us what you set for -march in your case? Could you try with -march=skylake-avx512 with your GCC 11.3?

maxim-masterov commented 11 months ago

I used -march=znver2. I can try -march=skylake-avx512, but I don't have access to Genoa CPUs at the moment. Should get it back next week. I will report as soon as I get some results.

Although, -march=skylake-avx512 (I'm sure) will help building the code, it won't resolve the problem with an undefined macro, right?

terminationshock commented 11 months ago

Yes, we will look at the macro anyway.

However, I am a bit puzzled that the configure detects AVX-512 instructions anyhow with -march=znver2. In the gcc 11 manpage, it does not list AVX-512 for this flag:

znver2
               AMD Family 17h core based CPUs with x86-64 instruction set support. (This supersets BMI, BMI2, CLWB, F16C, FMA, FSGSBASE, AVX, AVX2, ADCX, RDSEED, MWAITX, SHA, CLZERO, AES, PCLMUL, CX16,
               MOVBE, MMX, SSE, SSE2, SSE3, SSE4A, SSSE3, SSE4.1, SSE4.2, ABM, XSAVEC, XSAVES, CLFLUSHOPT, POPCNT, RDPID, WBNOINVD, and 64-bit instruction set extensions.)
maxim-masterov commented 11 months ago

@terminationshock Just to confirm, the -march=skylake-avx512 indeed allowed the build to finish