mpimd-csc / flexiblas

FlexiBLAS - A BLAS and LAPACK wrapper library with runtime exchangeable backends. This is only a mirror of https://gitlab.mpi-magdeburg.mpg.de/software/flexiblas-release
https://www.mpi-magdeburg.mpg.de/projects/flexiblas
GNU Lesser General Public License v3.0
37 stars 7 forks source link

LAPACK decompositions do not run in parallel #7

Closed Enchufa2 closed 3 years ago

Enchufa2 commented 3 years ago

Reported in #1889069. Reproducible example:

$ docker run --rm -it fedora:33
$ dnf install -y wget python3-numpy
$ wget https://gist.githubusercontent.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276/raw/660904cb770197c3c841ab9b7084657b1aea5f32/numpy-benchmark.py
$ python3 numpy-benchmark.py 
Dotted two 4096x4096 matrices in 0.62 s.
Dotted two vectors of length 524288 in 0.05 ms.
SVD of a 2048x1024 matrix in 7.41 s.
Cholesky decomposition of a 2048x2048 matrix in 1.18 s.
Eigendecomposition of a 2048x2048 matrix in 40.06 s.

Note the timings for SVD, Cholesky and Eigen decompositions. Now,

$ LD_PRELOAD=/lib64/libopenblaso.so.0 python3 numpy-benchmark.py 
Dotted two 4096x4096 matrices in 0.60 s.
Dotted two vectors of length 524288 in 0.04 ms.
SVD of a 2048x1024 matrix in 0.55 s.
Cholesky decomposition of a 2048x2048 matrix in 0.12 s.
Eigendecomposition of a 2048x2048 matrix in 5.89 s.
Enchufa2 commented 3 years ago

More findings about the issue:

So this seems to happen only when numpy is built against FlexiBLAS. But AFAICT, it is correctly linked and the output of ldd for the different numpy .so files seem sane to me. So I'm a bit lost here.

grisuthedragon commented 3 years ago

That sounds strange. I remind me that that setup.py procedure of numpy is not only searching for a BLAS library it also tries to search for some special symbols in the BLAS library to identify whether it is ATLAS, Goto, Open, MKL.. etc. I think me need to look in the build process of the numpy package. Are the build logs for the fc33 are somewhere available.

lupinix commented 3 years ago

Thats a good pointer. It looks like flexiblas is configured using openblas options in Fedoras numpy build: https://src.fedoraproject.org/rpms/numpy/blob/master/f/numpy.spec#_110

Enchufa2 commented 3 years ago

It is configured using the openblas key, because there's no specific key for flexiblas upstream and we thought it would be best if Numpy just thinks that it's using OpenBLAS, but this configuration points to libflexiblas (at the top of the spec, blaslib is set to flexiblas). In the build log, you can check that the flag being used is -lflexiblas, and libflexiblas.so.3()(64bit) is correctly listed as Requires at the end of the build.

Enchufa2 commented 3 years ago

Also, for Fedora 33, we have:

# ldd /usr/lib64/python3.9/site-packages/numpy/linalg/lapack_lite.cpython-39-x86_64-linux-gnu.so 
        linux-vdso.so.1 (0x00007ffcb4d80000)
        libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f1291ceb000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f1291b1f000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f12919d9000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f12919d2000)
        libgfortran.so.5 => /lib64/libgfortran.so.5 (0x00007f1291717000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f12916f5000)
        libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f12916a9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f129209e000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f129168e000)

For Fedora 32 (linked against openblas-threads), we have

# ldd /usr/lib64/python3.8/site-packages/numpy/linalg/lapack_lite.cpython-38-x86_64-linux-gnu.so 
        linux-vdso.so.1 (0x00007fff9a9fa000)
        libopenblasp.so.0 => /lib64/libopenblasp.so.0 (0x00007f0ce9b92000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f0ce99c8000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f0ce9883000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0ce9861000)
        libgfortran.so.5 => /lib64/libgfortran.so.5 (0x00007f0ce9596000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f0cebfd4000)
        libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f0ce954c000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f0ce952f000)

The presence of libdl.so.2 is the only difference, besides some reordering. I checked SciPy too, and we have the same problem for its SVD function (which was to be expected, because they have very similar configuration and build systems).

lupinix commented 3 years ago

Maybe using the buildtime openblas key changes something internally?

grisuthedragon commented 3 years ago

According to numpy's documentation, they use dgesdd from LAPACK to compute the SVD. In the benchmark, the SVD is called such that the singular values and the vectors are returned. I created a minimal working example to test this routines here: https://gist.githubusercontent.com/grisuthedragon/8956628ddde9af7505fb8c0d5c799c1e/raw/6d595f8006ef1129d16b60e82b4014e5e5a8d7ad/test_dgesdd.f90

Then I did the following on a FC33 VM / 4 Cores / Intel Sandybridge:

[root@fedora33 ~]# gfortran test_dgesdd.f90  -lflexiblas -o test_dgesdd.flexiblas
[root@fedora33 ~]# gfortran test_dgesdd.f90  -lopenblaso-r0.3.10 -o test_dgesdd.openblaso 
[root@fedora33 ~]# gfortran test_dgesdd.f90  -lopenblas-r0.3.10 -o test_dgesdd.openblass
[root@fedora33 ~]# gfortran test_dgesdd.f90  -llapack -lblas -o test_dgesdd.netlib

[root@fedora33 ~]# FLEXIBLAS=OPENBLAS-OPENMP ./test_dgesdd.flexiblas 
 Time =    1.4351626060000000     
[root@fedora33 ~]# FLEXIBLAS=OPENBLAS-SERIAL ./test_dgesdd.flexiblas 
 Time =    2.8791492609999998     
[root@fedora33 ~]# ./test_dgesdd.openblaso
 Time =    1.4242952959999999     
[root@fedora33 ~]# ./test_dgesdd.openblass
 Time =    2.9305672999999999     
[root@fedora33 ~]# ./test_dgesdd.netlib
 Time =    18.114536298000001

Regarding the runtimes we see that, as expected, FlexiBLAS-OpenBLAS and OpenBLAS naive are similar. But the Netlib time is high as expected. From this point of view it seems that in the Numpy case, FlexiBLAS falls back to its internal NETLIB fallback interface. I think it has something to do with the way Python and Numpy include their shared library addons. Using strange flags in dlopen could be the reason. In R, as @Enchufa2 tested, this isn't a problem. I remeber from my first developments (around version 1 in 2013) with FlexiBLAS, that a colleague produced sometimes a "symbol not found" error, when using FlexiBLAS with Numpy.

I will take a deeper look in this during this week. But If anybody has some ideas, especially how to monitor what python is doing under the hood, I am interested in.

grisuthedragon commented 3 years ago

I think I have the reason. Python loads all shared objects with RLTD_LAZY | RTLD_LOCAL and thus the first call to a LAPACK routine picks the first symbol in the GOT which fits the requirements. Since FlexiBLAS first integrates all Fallback symbols in the GLOBAL offset table during the initialization and the remaining ones(like everything from BLAS) it not looked up yet. The first call from Python/Numpy to LAPACK will pick the fallback we loaded in the GOT. I am currently developing the fix for this.

A short fix would be to replace the RTLD_GLOBAL flags in src/flexiblas.c / flexiblas_init by RTLD_LOCAL, but some first test show that this is only half of the truth.

grisuthedragon commented 3 years ago

Fixed in version 3.0.4