GNU Assembler bug workaround

lgarrison commented 4 years ago

This fixes #193 by querying the compiler to see if it is using a known-bad version of the GNU Assembler (GAS) under the hood. If so, AVX-512 is disabled. This works for gcc and icc (both of which use GAS under the hood), and clang if it is using GAS instead of its integrated assembler.

If AVX-512 is disabled, a warning will be printed during compilation. I think the pip installation will still be silent though, which may not be what we want.

I also had to change icc -xHost to icc -march=native. It doesn't seem possible to disable AVX-512F (with -mno-avx512f, for example) with the former, only the latter. I don't think it will make a big difference.

There was also a bug in the icc compiler options: we were using -xHost -xCORE-AVX512 indiscriminately, which will always generate AVX-512 instructions, even on non-AVX-512 machines!

I also changed the compilation output coloring to use tput instead of echo and escape sequences. On Ubuntu, I was getting tons of -es in my output because sh is actually dash, and echo in dash doesn't have a -e flag (it's enabled by default). tput seemed like a more robust solution than trying to detect dash, but let me know what you think.

manodeep commented 4 years ago

I tested on ozstar, with binutils/2.30. While the fix worked with gcc (disabled AVX512F), it did not work for icc.

I had added an #error AVX512F should not be enabled line here to make sure that the compilation fails, if AVX512F was enabled. Here is the error generated:

icc -DDOUBLE_PREC  -DVERSION=\"2.3.1\" -DUSE_UNICODE -std=c99 -m64 -g -Wsign-compare -Wall -Wextra -Wshadow -Wunused -fPIC -D_POSIX_SOURCE=200809L -D_GNU_SOURCE -D_DARWIN_C_SOURCE -O3  -march=native -qopenmp -mno-avx512f -I../../io -I../../utils  -c countpairs_impl_double.c -o countpairs_impl_double.o
icc: command line warning #10159: invalid argument for option '-m'
In file included from countpairs_impl_double.c(21):
countpairs_kernels_double.c(30): error: #error directive: AVX512F should not be enabled
  #error AVX512F should not be enabled
   ^

The option to disable AVX512F with icc is different, but I have not yet been able to figure out what that is. My attempts with -mno-COMMON-AVX512 etc were fruitless. Here is how I was trying it:

[~/temp/test_corrfunc_gas_bug/Corrfunc @farnarkle2] icc -O3 -std=c99  -dM -E -xc -c /dev/null | grep -i avx
[~/temp/test_corrfunc_gas_bug/Corrfunc @farnarkle2] icc -march=native -O3 -std=c99  -dM -E -xc -c /dev/null | grep -i avx
#define __AVX__ 1
#define __AVX_I__ 1
#define __AVX2__ 1
#define __AVX512F__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512BW__ 1
#define __AVX512VL__ 1

So the -march=native is definitely affecting the compiler flags. Without the -march=native, the AVX* flags are not enabled.

I do have another suggestion - will you please replace the magenta colour for the text for the message "Disabling AVX512" with red. This is an important enough message that users should pay attention.

lgarrison commented 4 years ago

Hmm, that's too bad about icc. It works for me on icc 19, what version are you running? But we'll have to think of another solution that works for earlier icc too.

manodeep commented 4 years ago

@lgarrison I was using icc 18.0.1. Intel/2016 (the other version available on ozstar), does not support AVX512 - so the fix is not triggered.

manodeep commented 4 years ago

One solution could be to capture the output of $(CC) -march=native -dM -E -xc -c /dev/null and checking what instruction sets are supported. We can then filter-out the AVX512 flags when GAS 2.30/2.31 is detected

lgarrison commented 4 years ago

Can you try this version? -xCORE-AVX2 seems to override -march=native, even on older icc.

manodeep commented 4 years ago

Yup - this works for me with intel 2018. Okay to go ahead and "squash + merge"?

lgarrison commented 4 years ago

Yes!

lgarrison commented 3 years ago

Just adding a note to document that @SandyYuan and I were able to get AVX-512 working on a Linux system that exhibited this bug by using conda-forge compilers. The steps were:

$ conda create -n forge -c conda-forge python=3
$ conda activate forge
$ conda install -c conda-forge gcc_linux-64 binutils_linux-64 gsl
$ git clone https://github.com/manodeep/Corrfunc.git
$ cd Corrfunc/
$ # edit common.mk with CC := x86_64-conda_cos6-linux-gnu-gcc
$ pip install -e ./

In theory, mainline conda also has binutils>=2.32, but that wasn't working for us. And there's probably a syntax to pass CC to pip without cloning the repo to edit common.mk, but we didn't test that.

manodeep commented 3 years ago

The (in)ability to pass the compiler at the command-line when installing with pip has always bothered me. I have now figured it out -- the solution is to use python -m pip install --verbose -e . --install-option="CC=/path/to/custom/compiler"

Relevant SO link

lgarrison commented 3 years ago

Forgot to mention maybe the most pernicious bug: one should install GSL from conda when using conda's compilers, because if it picks up the system GSL, it will add -I/usr/include to the CFLAGS, which will intercept non-GSL #includes, loading a set of headers incompatible with conda's compilers.

manodeep commented 3 years ago

Right - of course! You have to be entirely within the conda ecosystem

valerio-marra commented 2 years ago

conda install -c conda-forge gcc_linux-64 binutils_linux-64 gsl

hi @lgarrison, this solution worked for me. In case it is interesting, i report the improvement in performance:

Node details:

4x12 Core Skylake Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
512 GB DDR3 1333 MHz

AVX

Running corrfunc with 20000000 particles on 48 threads DDtheta_mocks took 519.1 seconds
Running corrfunc with 20000000 particles on 96 threads DDtheta_mocks took 409.5 seconds

AVX512

Running corrfunc with 20000000 particles on 48 threads DDtheta_mocks took 521.4 seconds
Running corrfunc with 20000000 particles on 96 threads DDtheta_mocks took 332.8 seconds

The odd thing is how much performance with AVX512 decreases when 48 threads are used.

lgarrison commented 2 years ago

Thanks for reporting back! I'm glad you did get a speedup after all. Definitely curious that you don't see a speedup until you use hyperthreading. Maybe the floating-point throughput is not the bottleneck initially, until you use enough threads to mask whatever memory/cache/backend latency is occurring.

You may also be interested in seeing if PR #258 give you extra performance, if your use-case allows for evenly-spaced bins.

manodeep / Corrfunc

GNU Assembler bug workaround #196