Open dstndstn opened 4 years ago
Oh, but we do have gcc v9 on this machine, and that seems to work.
Thanks for the report! Glad you have a workaround, but we should get gcc 5 to work anyway. I don't have a gcc 5 environment immediately available, but let me see if I can drum one up... Feel free to leave the issue open in the meantime.
For the record,
$ gcc --version gcc (Ubuntu 9.2.1-17ubuntu1~16.04) 9.2.1 20191102 $ pip install --no-cache-dir Corrfunc
succeeds.
Or update the docs to note this requirement :) Thanks!
Probably related to #183
@dstndstn Could you share the output of gcc -march=native -dM -E - < /dev/null | grep AVX512
with your gcc 5 compiler?
My best guess is that the CPU supports AVX512VL but not the gcc 5 compiler, so we're falling back to the AVX2 version of AVX512_BLEND_INTS_WITH_MASK
. But when we do so, we're trying to use a non-immediate mask operand, which, according to that compiler error, is unsupported.
On the other hand, the gcc 5 release notes suggest it should support AVX512VL. So the preprocessor output will be interesting!
$ gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
$ gcc --version
gcc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
Thanks. So indeed, despite the CPU (and supposedly the compiler??) supporting AVX512VL, the compiler is not trying to use it.
@manodeep Maybe there's a way to detect this scenario manually and pass -mavx512vl
to the compiler in addition to -march=native
. But regardless, I think the AVX2 int blend fallback will not work as written, because it uses non-immediate operands. So we may need to remove that fallback in favor of non-vectorized code, or refactor the part of the kernel that uses it.
Interesting! Looks like both gcc5.5.0
and gcc6.4.0
do not enable AVX512VL with -march=native
[~ @farnarkle2] ml --force purge && ml gcc/5.5.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
[~ @farnarkle2] ml --force purge && ml gcc/6.4.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
[~ @farnarkle2] ml --force purge && ml gcc/7.3.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512VL__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
@dstndstn If you add CFLAGS=-mavx512vl
at the very top of common.mk
, does the code then compile with gcc5.5
?
@lgarrison Perhaps we should print out an info message warning users on AVX512-systems with gcc < 7.3
.
@manodeep If we can detect the scenario, it seems we should just fix it ourselves my adding the -mavx512vl
flag. The flag is mentioned in the release notes: https://gcc.gnu.org/gcc-5/changes.html
But not in the corresponding doc page: https://gcc.gnu.org/onlinedocs/gcc-5.5.0/gcc/x86-Options.html#x86-Options
My guess it is it's supported, just not automatically added with the -march=native
flag. Does gcc 5 allow it for you on farnakle?
Regardless, I think the AVX2 fallback has a real bug.
If I do (with gcc 5.5.0)
export CFLAGS=-mavx512vl
make
it gets further but dies with
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" countpairs_rp_pi_impl.c.src >> countpairs_rp_pi_impl_float.c
gcc -DNDOUBLE_PREC -mavx512vl -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4 -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2 -Wpacked -Wnested-externs -Wpointer-arith -Wredundant-decls -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wstrict-prototypes -Wno-unused-local-typedefs -I../../io -I../../utils -c countpairs_rp_pi_impl_float.c -o countpairs_rp_pi_impl_float.o
In file included from ../../utils/weight_functions_float.h:12:0,
from countpairs_rp_pi_kernels_float.c:23,
from countpairs_rp_pi_impl_float.c:22:
countpairs_rp_pi_kernels_float.c: In function ‘countpairs_rp_pi_avx512_intrinsics_float’:
../../utils/avx512_calls.h:175:46: warning: implicit declaration of function ‘_mm512_abs_ps’ [-Wimplicit-function-declaration]
#define AVX512_ABS_FLOAT(X) _mm512_abs_ps(X)
^
countpairs_rp_pi_kernels_float.c:210:23: note: in expansion of macro ‘AVX512_ABS_FLOAT’
m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
^
In file included from countpairs_rp_pi_impl_float.c:22:0:
countpairs_rp_pi_kernels_float.c:210:13: warning: nested extern declaration of ‘_mm512_abs_ps’ [-Wnested-externs]
m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
^
countpairs_rp_pi_kernels_float.c:210:21: error: incompatible types when assigning to type ‘__m512 {aka __vector(16) float}’ from type ‘int’
m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
^
../../rules.mk:64: recipe for target 'countpairs_rp_pi_impl_float.o' failed
Thanks. I think that one is probably this bug: https://stackoverflow.com/questions/51290930/why-does-gcc-not-provide-the-avx512f-intrinsic-for-absolute-value-of-floats?rq=1
Looks like we can work around this one as well. That's two workarounds for gcc 5... but gcc 5.5.0 is only 3 years old, not sure it makes sense to drop support!
gcc
is terrible when it comes to intrinsics. For instance, we already have a non-standard AVX_ABS_FLOAT
in avx_calls.h
here. It is trivial to do the same for AVX512_ABS_FLOAT
or better yet, directly take the implementation from the gcc patch listed in the link above.
I am wondering if we should report these two bugs (missing intrinsics, missing __AVX512VL__
under -march=native
on SKX cpus) to gcc
.
@lgarrison I agree with you - we should be able to support compilers that are only 3 years old.
After looking through gcc docs (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options), apparently -march=skylake
does NOT add any AVX512 instructions. For that, we need -march=skylake-avx512
; and that target architecture is not available on gcc5 (only gcc6 or higher). Since we are seeing quite a few of the AVX512 isa's enabled by -march=native
, I am uncertain what -march-native
is translating to on a SKX cpu (something in between skylake
and skylake-avx512
).
I think there's 3 things to do here:
1) Detect if the CPU and compiler support -mavx512vl
but __AVX512VL__
is not present in the preprocessor output for -march=native
. If so, add -mavx512vl
to CFLAGS
.
2) Provide _mm512_abs_ps()
if it is not present (needs more thought, how do we detect if this is defined? Does the system library define a macro? Or should we always just use our own implementation?)
3) Remove the AVX2 fallback, If I'm reading the error messages correctly, I don't think it will work under any circumstances.
@manodeep, does this all sound right to you?
This will require a bit of thinking so that the complexity does not go up (and that we are not implementing configure
like capabilities into Corrfunc
). The original reason to only target AVX512F
was to make sure that Corrfunc runs on both SKX+ and KNL+ hardware. However, the code currently fails to compile on KNL for this same issue (#183)
For the abs_ps
, icc/clang/gcc >= 7.3
always provide the intrinsics (I think), and therefore we could simply use a custom one for lower gcc versions.
The AVX2 definitely is incorrect and needs to be replaced with _mm512_mask_blend_pd
+ some no-op casts from _m512i -> _m512d -> _m512i
to keep the compiler happy. (Related, I am not sure we need _m256i
- we could just use _m512i
and the remaining 4 bins are all identically set to 0. We might just need to grab the lower half of the zmm
register or however the array indexing works...). The upside could be that we fix the KNL build failure and manage to run Corrfunc on that
General information
Issue description
Compile error
Expected behavior
Actual behavior
What have you tried so far?
System information: