mpimd-csc / flexiblas

FlexiBLAS - A BLAS and LAPACK wrapper library with runtime exchangeable backends. This is only a mirror of https://gitlab.mpi-magdeburg.mpg.de/software/flexiblas-release
https://www.mpi-magdeburg.mpg.de/projects/flexiblas
GNU Lesser General Public License v3.0
36 stars 7 forks source link

flexiblas-openblas-openmp with OMP_PROC_BIND binds the program to a single core #19

Closed KoykL closed 2 years ago

KoykL commented 2 years ago

When loading the openblas-openmp library in flexiblas.c, __flexiblas_dlopen is used, which calls dlopen on openblas-openmp twice. The first call is used to retrieve global variables such as "flexiblas_ld_global", and the value is used to make the second "actual" dlopen call.

However, when OMP_PROC_BIND is used, during the first dlopen call to openblas-openmp, openmp runtime will set the cpu affinity mask of the current thread to just the first cpu core. During the second dlopen call to openblas-openmp, this affinity mask will be picked up by the openmp runtime, and make the openmp runtime to think that the first cpu core is the only available cpu. Therefore, only the first cpu will be used when OMP_PROC_BIND is used with openblas-openmp and flexiblas.

Is there a way to use all cpu cores when cpu binding is used with openmp? Thank you!

grisuthedragon commented 2 years ago

Hi KoykL,

thank for the report. In order to reproduce the problem, can you give me some information about your system. CPU, OS, Compiler? OpenMP Library?

KoykL commented 2 years ago

OS: Fedora 35 Compiler: gcc 11.2.1 openmp runtime: libgomp-11.2.1 flexiblas: 3.0.4 cpu: epyc 7502p

grisuthedragon commented 2 years ago

So I tried to reproduce the bug on the following system:

I compiled FlexiBLAS with

cmake3 ../ -DDEV=ON -DEXTRA="OB19" -DOB19_LIBRARY="/scratch/openblas-0.3.19/lib/libopenblas.so;gomp;pthread"
make -j 16

and then executed a DGEMM benchmark with FlexiBLAS-Openblas and OpenBlas linked directly. (The benchmark tool can be found in the examples folder after build)

# DGEMM Benchmark on all 128 Cores ( 64 + 64 HT) for two 20000x20000 matrices using only the OB19 backend(OpenBLAS 0.3.19 from above)
$ OMP_NUM_THREADS=128 OMP_PROC_BIND=true ./benchmark -b DGEMM -d 20000 -r 1 -o OB19
# Dimension: 20000
# Runs: 1 
# Benchmark: DGEMM
# Only: OB19
#                         Name      Runtime          GFlops
                          OB19   1.70541580e+01          9.38187627e+02

# Benchmark with directly linked OpenBLAS 0.3.19:
$ OMP_NUM_THREADS=128 OMP_PROC_BIND=true ./benchmark.OB19 -b DGEMM -d 20000 -r 1 
# Dimension: 20000
# Runs: 1 
# Benchmark: DGEMM
#                         Name      Runtime          GFlops
                            OB19   1.70563250e+01          9.38068431e+02

with are up to small disturbances the same performance values. I also monitored it in background with top which CPU cores are utilizes and all 128 cores are working in both cases.

Thus, I could not reproduce your problem. Please provide me a self containing MWE example, which I can build and debug. Otherwise I close this issue within the next days.

Enchufa2 commented 2 years ago

@KoykL Are you using FlexiBLAS as provided in the official Fedora repos? Cannot reproduce here either.

KoykL commented 2 years ago

I am using official repo flexiblas. However, I am calling blas from python (numpy). Let me investigate a bit more on what is actually happening.

KoykL commented 2 years ago

I figured out why you cannot reproduce the problem.

The compiler arguments for building benchmark by default includes "-fopenmp". Including openmp runtime in the main program probably causes openmp runtime not to be restarted across the two dlopen call in __flexiblas_dlopen.

To reproduce the behavior I reported, remove "-fopenmp" for benchmark.c. e.g. Original command: /usr/bin/cc -fPIC -std=c99 -D_FILE_OFFSET_BITS=64 -fopenmp -Wno-unused-parameter -O3 -DNDEBUG -rdynamic CMakeFiles/benchmark.dir/benchmark.c.o -o benchmark -Wl,-rpath,/tmp/flexiblas/build/lib -lm -ldl ../lib/libflexiblas.so.3.0 ../libcscutils/lib/libcscutils.a -lm -ldl -lgfortran -lm -lgfortran -lm -lquadmath -lm Modified command: /usr/bin/cc -fPIC -std=c99 -D_FILE_OFFSET_BITS=64 -Wno-unused-parameter -O3 -DNDEBUG -rdynamic CMakeFiles/benchmark.dir/benchmark.c.o -o benchmark -Wl,-rpath,/tmp/flexiblas/build/lib -lm -ldl ../lib/libflexiblas.so.3.0 ../libcscutils/lib/libcscutils.a -lm -ldl -lgfortran -lm -lgfortran -lm -lquadmath -lm

If successful, when you run benchmark with additional environmental variable OMP_DISPLAY_ENV=VERBOSE, you will see "OPENMP DISPLAY ENVIRONMENT BEGIN" twice, indicating openmp runtime has been restarted, and triggering the bug I originally reported.

With the original compiler arguments, "OPENMP DISPLAY ENVIRONMENT BEGIN" will only be printed once.

Enchufa2 commented 2 years ago

I confirm that the bug exists at least for the stock version of numpy. E.g.:

$ docker run --rm -it fedora:35
$ dnf -y install python3-numpy wget
$ wget https://raw.githubusercontent.com/dmytrov/benchmark-GEMM/master/numpy-benchmark.py
$ python3 numpy-benchmark.py

I see all my cores spinning. However,

$ OMP_PROC_BIND=true python3 numpy-benchmark.py

spins a single core. I tried GEMM benchmarking in R with and without OMP_PROC_BIND=true and there's no issue. Notably, R includes -fopenmp:

$ R CMD config --ldflags
-Wl,--export-dynamic -fopenmp -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -L/usr/lib64/R/lib -lR -ltre -lpcre2-8 -llzma -lbz2 -lz -lrt -ldl -lm -licuuc -licui18n

Numpy does not.

grisuthedragon commented 2 years ago

I could reproduce the error now myself. Searching for the problem I found that serval other projects interfacing OpenMP from Python had similar problems. I found two different solutions for the problem (at the moment):

  1. Add the `NODELETE* flag to the dlopen calls. The main advantage is, that flexiblas does not be linked against OpenMP ( since it does not use any OpenMP routine). The disadvantage is NODELETE is not a POSIX flag, but available on my personal main targets Linux and FreeBSD.
  2. Link FlexiBLAS against OpenMP. Advantage OpenMP will be initialized properly. Disadvantage: OpenMP is initialized even if no application uses it.

I would implement it in the following way:

If this behavior is fine, I include it in the next release (@Enchufa2 scheduled for tomorrow)

KoykL commented 2 years ago

I think this is a good solution to the problem.

Enchufa2 commented 2 years ago

Sounds good to me. :)

grisuthedragon commented 2 years ago

Closed with 3.1.0