manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.
https://corrfunc.readthedocs.io
MIT License
163 stars 50 forks source link

Build failure - related to avx512? #217

Open dstndstn opened 4 years ago

dstndstn commented 4 years ago

General information

Issue description

Compile error

Expected behavior

Actual behavior

What have you tried so far?

$ git clone https://github.com/manodeep/Corrfunc/
Cloning into 'Corrfunc'...
[etc]
$ cd Corrfunc
$ make
mkdir -p lib
mkdir -p bin
mkdir -p include
make -C theory
make[1]: Entering directory '/tmp/Corrfunc/theory'

-------COMPILE SETTINGS------------
     MAKE        = ["make"]
     CC      = ["gcc"]
     OPT         = ["-DPERIODIC -DENABLE_MIN_SEP_OPT  -DCOPY_PARTICLES  -DUSE_OMP"]
     CFLAGS      = [" -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs "]
     CLINK       = [" -lrt  -fopenmp -lm"]
     PYTHON      = ["python"]
     GSL_CFLAGS  = ["-I/cm/shared/apps/gsl/gsl-2.5/include"]
     GSL_LINK    = ["-L/cm/shared/apps/gsl/gsl-2.5/lib -lgsl -lgslcblas -lm -Xlinker -rpath -Xlinker /cm/shared/apps/gsl/gsl-2.5/lib"]
     PYTHON_CFLAGS   = ["-isystem/cm/shared/apps/conda-environments/python36/include/python3.6m -isystem /cm/shared/apps/conda-environments/python36/lib/python3.6/site-packages/numpy/core/include/numpy/"]
-------END OF COMPILE SETTINGS------------

make -C DD
make[2]: Entering directory '/tmp/Corrfunc/theory/DD'
gcc -DPERIODIC -DENABLE_MIN_SEP_OPT  -DCOPY_PARTICLES  -DUSE_OMP  -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs  -I../../io -I../../utils  -c DD.c -o DD.o
gcc  -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs  -I../../io -I../../utils  -c ../../io/ftread.c -o ../../io/ftread.o
gcc  -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs  -I../../io -I../../utils  -c ../../io/io.c -o ../../io/io.o
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  ../../utils/weight_defs.h.src >> ../../utils/weight_defs_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  countpairs_impl.h.src >> countpairs_impl_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g"  ../../utils/weight_defs.h.src >> ../../utils/weight_defs_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g"  countpairs_impl.h.src >> countpairs_impl_float.h
gcc  -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs  -I../../io -I../../utils  -c countpairs.c -o countpairs.o
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" countpairs_kernels.c.src >> countpairs_kernels_float.c
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  countpairs_kernels.c.src >> countpairs_kernels_double.c
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g"  ../../utils/gridlink_impl.h.src >> ../../utils/gridlink_impl_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  ../../utils/gridlink_impl.h.src >> ../../utils/gridlink_impl_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g"  ../../utils/gridlink_utils.h.src >> ../../utils/gridlink_utils_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  ../../utils/gridlink_utils.h.src >> ../../utils/gridlink_utils_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g"  ../../utils/cellarray.h.src >> ../../utils/cellarray_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  ../../utils/cellarray.h.src >> ../../utils/cellarray_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g"  ../../utils/cell_pair.h.src >> ../../utils/cell_pair_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  ../../utils/cell_pair.h.src >> ../../utils/cell_pair_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  ../../utils/weight_functions.h.src >> ../../utils/weight_functions_double.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g"  ../../utils/weight_functions.h.src >> ../../utils/weight_functions_float.h
sed -e "/DOUBLE_PREC/!s/DOUBLE/double/g"  countpairs_impl.c.src >> countpairs_impl_double.c
gcc -DDOUBLE_PREC  -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs  -I../../io -I../../utils  -c countpairs_impl_double.c -o countpairs_impl_double.o
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:43:0,
                 from ../../utils/avx512_calls.h:16,
                 from ../../utils/weight_functions_double.h:12,
                 from countpairs_kernels_double.c:23,
                 from countpairs_impl_double.c:21:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx2intrin.h: In function ‘countpairs_avx512_intrinsics_double’:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx2intrin.h:973:20: error: the last argument must be an 8-bit immediate
   return (__m256i) __builtin_ia32_pblendd256 ((__v8si)__X,
                    ^
../../rules.mk:61: recipe for target 'countpairs_impl_double.o' failed
make[2]: *** [countpairs_impl_double.o] Error 1

System information:

$ head -n 2 /etc/os-release
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
$ make -v
GNU Make 4.1
...
$ gcc --version
gcc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
...
$ pkg-config --modversion gsl
2.5
$ python --version
Python 3.6.0 :: Continuum Analytics, Inc.
$ which python
/cm/shared/apps/conda-environments/python36/bin/python
$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz
stepping        : 4
microcode       : 0x2000064
cpu MHz         : 2683.513
cache size      : 14080 KB
physical id     : 0
siblings        : 20
core id         : 0
cpu cores       : 10
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 4400.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

... [x40]
dstndstn commented 4 years ago

Oh, but we do have gcc v9 on this machine, and that seems to work.

lgarrison commented 4 years ago

Thanks for the report! Glad you have a workaround, but we should get gcc 5 to work anyway. I don't have a gcc 5 environment immediately available, but let me see if I can drum one up... Feel free to leave the issue open in the meantime.

dstndstn commented 4 years ago

For the record,

$ gcc --version gcc (Ubuntu 9.2.1-17ubuntu1~16.04) 9.2.1 20191102 $ pip install --no-cache-dir Corrfunc

succeeds.

dstndstn commented 4 years ago

Or update the docs to note this requirement :) Thanks!

lgarrison commented 4 years ago

Probably related to #183

lgarrison commented 4 years ago

@dstndstn Could you share the output of gcc -march=native -dM -E - < /dev/null | grep AVX512 with your gcc 5 compiler?

My best guess is that the CPU supports AVX512VL but not the gcc 5 compiler, so we're falling back to the AVX2 version of AVX512_BLEND_INTS_WITH_MASK. But when we do so, we're trying to use a non-immediate mask operand, which, according to that compiler error, is unsupported.

On the other hand, the gcc 5 release notes suggest it should support AVX512VL. So the preprocessor output will be interesting!

dstndstn commented 4 years ago
$ gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1

$ gcc --version
gcc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010
lgarrison commented 4 years ago

Thanks. So indeed, despite the CPU (and supposedly the compiler??) supporting AVX512VL, the compiler is not trying to use it.

@manodeep Maybe there's a way to detect this scenario manually and pass -mavx512vl to the compiler in addition to -march=native. But regardless, I think the AVX2 int blend fallback will not work as written, because it uses non-immediate operands. So we may need to remove that fallback in favor of non-vectorized code, or refactor the part of the kernel that uses it.

manodeep commented 4 years ago

Interesting! Looks like both gcc5.5.0 and gcc6.4.0 do not enable AVX512VL with -march=native

[~ @farnarkle2] ml --force purge && ml gcc/5.5.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1

[~ @farnarkle2] ml --force purge && ml gcc/6.4.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1

[~ @farnarkle2] ml --force purge && ml gcc/7.3.0
[~ @farnarkle2] gcc -march=native -dM -E - < /dev/null | grep AVX512
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512VL__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1

@dstndstn If you add CFLAGS=-mavx512vl at the very top of common.mk, does the code then compile with gcc5.5?

@lgarrison Perhaps we should print out an info message warning users on AVX512-systems with gcc < 7.3.

lgarrison commented 4 years ago

@manodeep If we can detect the scenario, it seems we should just fix it ourselves my adding the -mavx512vl flag. The flag is mentioned in the release notes: https://gcc.gnu.org/gcc-5/changes.html

But not in the corresponding doc page: https://gcc.gnu.org/onlinedocs/gcc-5.5.0/gcc/x86-Options.html#x86-Options

My guess it is it's supported, just not automatically added with the -march=native flag. Does gcc 5 allow it for you on farnakle?

Regardless, I think the AVX2 fallback has a real bug.

dstndstn commented 4 years ago

If I do (with gcc 5.5.0)

export CFLAGS=-mavx512vl
make

it gets further but dies with

sed -e "/DOUBLE_PREC/!s/DOUBLE/float/g" countpairs_rp_pi_impl.c.src >> countpairs_rp_pi_impl_float.c
gcc -DNDOUBLE_PREC -mavx512vl -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs  -DVERSION=\"2.3.3\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -ftree-vectorize -funroll-loops -fprefetch-loop-arrays --param simultaneous-prefetches=4  -fopenmp -funroll-loops -march=native -fno-strict-aliasing -Wformat=2  -Wpacked  -Wnested-externs -Wpointer-arith  -Wredundant-decls  -Wfloat-equal -Wcast-qual -Wcast-align -Wmissing-declarations -Wmissing-prototypes  -Wnested-externs -Wstrict-prototypes   -Wno-unused-local-typedefs  -I../../io -I../../utils  -c countpairs_rp_pi_impl_float.c -o countpairs_rp_pi_impl_float.o
In file included from ../../utils/weight_functions_float.h:12:0,
                 from countpairs_rp_pi_kernels_float.c:23,
                 from countpairs_rp_pi_impl_float.c:22:
countpairs_rp_pi_kernels_float.c: In function ‘countpairs_rp_pi_avx512_intrinsics_float’:
../../utils/avx512_calls.h:175:46: warning: implicit declaration of function ‘_mm512_abs_ps’ [-Wimplicit-function-declaration]
 #define AVX512_ABS_FLOAT(X)                  _mm512_abs_ps(X)
                                              ^
countpairs_rp_pi_kernels_float.c:210:23: note: in expansion of macro ‘AVX512_ABS_FLOAT’
             m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
                       ^
In file included from countpairs_rp_pi_impl_float.c:22:0:
countpairs_rp_pi_kernels_float.c:210:13: warning: nested extern declaration of ‘_mm512_abs_ps’ [-Wnested-externs]
             m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
             ^
countpairs_rp_pi_kernels_float.c:210:21: error: incompatible types when assigning to type ‘__m512 {aka __vector(16) float}’ from type ‘int’
             m_zdiff = AVX512_ABS_FLOAT(m_zdiff);//now take the absolute value
                     ^
../../rules.mk:64: recipe for target 'countpairs_rp_pi_impl_float.o' failed
lgarrison commented 4 years ago

Thanks. I think that one is probably this bug: https://stackoverflow.com/questions/51290930/why-does-gcc-not-provide-the-avx512f-intrinsic-for-absolute-value-of-floats?rq=1

Looks like we can work around this one as well. That's two workarounds for gcc 5... but gcc 5.5.0 is only 3 years old, not sure it makes sense to drop support!

manodeep commented 4 years ago

gcc is terrible when it comes to intrinsics. For instance, we already have a non-standard AVX_ABS_FLOAT in avx_calls.h here. It is trivial to do the same for AVX512_ABS_FLOAT or better yet, directly take the implementation from the gcc patch listed in the link above.

I am wondering if we should report these two bugs (missing intrinsics, missing __AVX512VL__ under -march=native on SKX cpus) to gcc.

@lgarrison I agree with you - we should be able to support compilers that are only 3 years old.

manodeep commented 4 years ago

After looking through gcc docs (https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#x86-Options), apparently -march=skylake does NOT add any AVX512 instructions. For that, we need -march=skylake-avx512; and that target architecture is not available on gcc5 (only gcc6 or higher). Since we are seeing quite a few of the AVX512 isa's enabled by -march=native, I am uncertain what -march-native is translating to on a SKX cpu (something in between skylake and skylake-avx512).

lgarrison commented 4 years ago

I think there's 3 things to do here: 1) Detect if the CPU and compiler support -mavx512vl but __AVX512VL__ is not present in the preprocessor output for -march=native. If so, add -mavx512vl to CFLAGS. 2) Provide _mm512_abs_ps() if it is not present (needs more thought, how do we detect if this is defined? Does the system library define a macro? Or should we always just use our own implementation?) 3) Remove the AVX2 fallback, If I'm reading the error messages correctly, I don't think it will work under any circumstances.

@manodeep, does this all sound right to you?

manodeep commented 4 years ago

This will require a bit of thinking so that the complexity does not go up (and that we are not implementing configure like capabilities into Corrfunc). The original reason to only target AVX512F was to make sure that Corrfunc runs on both SKX+ and KNL+ hardware. However, the code currently fails to compile on KNL for this same issue (#183)

For the abs_ps, icc/clang/gcc >= 7.3 always provide the intrinsics (I think), and therefore we could simply use a custom one for lower gcc versions.

The AVX2 definitely is incorrect and needs to be replaced with _mm512_mask_blend_pd + some no-op casts from _m512i -> _m512d -> _m512ito keep the compiler happy. (Related, I am not sure we need _m256i - we could just use _m512i and the remaining 4 bins are all identically set to 0. We might just need to grab the lower half of the zmm register or however the array indexing works...). The upside could be that we fix the KNL build failure and manage to run Corrfunc on that