Closed lgarrison closed 4 years ago
I tested on ozstar
, with binutils/2.30. While the fix worked with gcc (disabled AVX512F), it did not work for icc
.
I had added an #error AVX512F should not be enabled
line here to make sure that the compilation fails, if AVX512F was enabled. Here is the error generated:
icc -DDOUBLE_PREC -DVERSION=\"2.3.1\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3 -march=native -qopenmp -mno-avx512f -I../../io -I../../utils -c countpairs_impl_double.c -o countpairs_impl_double.o
icc: command line warning #10159: invalid argument for option '-m'
In file included from countpairs_impl_double.c(21):
countpairs_kernels_double.c(30): error: #error directive: AVX512F should not be enabled
#error AVX512F should not be enabled
^
The option to disable AVX512F with icc is different, but I have not yet been able to figure out what that is. My attempts with -mno-COMMON-AVX512
etc were fruitless. Here is how I was trying it:
[~/temp/test_corrfunc_gas_bug/Corrfunc @farnarkle2] icc -O3 -std=c99 -dM -E -xc -c /dev/null | grep -i avx
[~/temp/test_corrfunc_gas_bug/Corrfunc @farnarkle2] icc -march=native -O3 -std=c99 -dM -E -xc -c /dev/null | grep -i avx
#define __AVX__ 1
#define __AVX_I__ 1
#define __AVX2__ 1
#define __AVX512F__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512BW__ 1
#define __AVX512VL__ 1
So the -march=native
is definitely affecting the compiler flags. Without the -march=native
, the AVX*
flags are not enabled.
I do have another suggestion - will you please replace the magenta colour for the text for the message "Disabling AVX512" with red. This is an important enough message that users should pay attention.
Hmm, that's too bad about icc
. It works for me on icc
19, what version are you running? But we'll have to think of another solution that works for earlier icc
too.
@lgarrison I was using icc 18.0.1
. Intel/2016 (the other version available on ozstar), does not support AVX512
- so the fix is not triggered.
One solution could be to capture the output of $(CC) -march=native -dM -E -xc -c /dev/null
and checking what instruction sets are supported. We can then filter-out the AVX512 flags when GAS 2.30/2.31 is detected
Can you try this version? -xCORE-AVX2
seems to override -march=native
, even on older icc
.
Yup - this works for me with intel 2018. Okay to go ahead and "squash + merge"?
Yes!
Just adding a note to document that @SandyYuan and I were able to get AVX-512 working on a Linux system that exhibited this bug by using conda-forge compilers. The steps were:
$ conda create -n forge -c conda-forge python=3
$ conda activate forge
$ conda install -c conda-forge gcc_linux-64 binutils_linux-64 gsl
$ git clone https://github.com/manodeep/Corrfunc.git
$ cd Corrfunc/
$ # edit common.mk with CC := x86_64-conda_cos6-linux-gnu-gcc
$ pip install -e ./
In theory, mainline conda also has binutils>=2.32, but that wasn't working for us. And there's probably a syntax to pass CC
to pip without cloning the repo to edit common.mk
, but we didn't test that.
The (in)ability to pass the compiler at the command-line when installing with pip
has always bothered me. I have now figured it out -- the solution is to use python -m pip install --verbose -e . --install-option="CC=/path/to/custom/compiler"
Forgot to mention maybe the most pernicious bug: one should install GSL from conda when using conda's compilers, because if it picks up the system GSL, it will add -I/usr/include
to the CFLAGS
, which will intercept non-GSL #include
s, loading a set of headers incompatible with conda's compilers.
Right - of course! You have to be entirely within the conda
ecosystem
conda install -c conda-forge gcc_linux-64 binutils_linux-64 gsl
hi @lgarrison, this solution worked for me. In case it is interesting, i report the improvement in performance:
Node details:
AVX
AVX512
The odd thing is how much performance with AVX512 decreases when 48 threads are used.
Thanks for reporting back! I'm glad you did get a speedup after all. Definitely curious that you don't see a speedup until you use hyperthreading. Maybe the floating-point throughput is not the bottleneck initially, until you use enough threads to mask whatever memory/cache/backend latency is occurring.
You may also be interested in seeing if PR #258 give you extra performance, if your use-case allows for evenly-spaced bins.
This fixes #193 by querying the compiler to see if it is using a known-bad version of the GNU Assembler (GAS) under the hood. If so, AVX-512 is disabled. This works for
gcc
andicc
(both of which use GAS under the hood), andclang
if it is using GAS instead of its integrated assembler.If AVX-512 is disabled, a warning will be printed during compilation. I think the pip installation will still be silent though, which may not be what we want.
I also had to change
icc -xHost
toicc -march=native
. It doesn't seem possible to disable AVX-512F (with-mno-avx512f
, for example) with the former, only the latter. I don't think it will make a big difference.There was also a bug in the
icc
compiler options: we were using-xHost -xCORE-AVX512
indiscriminately, which will always generate AVX-512 instructions, even on non-AVX-512 machines!I also changed the compilation output coloring to use
tput
instead ofecho
and escape sequences. On Ubuntu, I was getting tons of-e
s in my output becausesh
is actuallydash
, andecho
indash
doesn't have a-e
flag (it's enabled by default).tput
seemed like a more robust solution than trying to detectdash
, but let me know what you think.