Closed Enchufa2 closed 3 years ago
More findings about the issue:
LD_PRELOAD=/lib64/libflexiblas.so.3
, decompositions run in parallel correctly.So this seems to happen only when numpy is built against FlexiBLAS. But AFAICT, it is correctly linked and the output of ldd
for the different numpy .so
files seem sane to me. So I'm a bit lost here.
That sounds strange. I remind me that that setup.py procedure of numpy is not only searching for a BLAS library it also tries to search for some special symbols in the BLAS library to identify whether it is ATLAS, Goto, Open, MKL.. etc. I think me need to look in the build process of the numpy package. Are the build logs for the fc33 are somewhere available.
Thats a good pointer. It looks like flexiblas is configured using openblas options in Fedoras numpy build: https://src.fedoraproject.org/rpms/numpy/blob/master/f/numpy.spec#_110
It is configured using the openblas
key, because there's no specific key for flexiblas upstream and we thought it would be best if Numpy just thinks that it's using OpenBLAS, but this configuration points to libflexiblas (at the top of the spec, blaslib
is set to flexiblas
). In the build log, you can check that the flag being used is -lflexiblas
, and libflexiblas.so.3()(64bit)
is correctly listed as Requires at the end of the build.
Also, for Fedora 33, we have:
# ldd /usr/lib64/python3.9/site-packages/numpy/linalg/lapack_lite.cpython-39-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007ffcb4d80000)
libflexiblas.so.3 => /lib64/libflexiblas.so.3 (0x00007f1291ceb000)
libc.so.6 => /lib64/libc.so.6 (0x00007f1291b1f000)
libm.so.6 => /lib64/libm.so.6 (0x00007f12919d9000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f12919d2000)
libgfortran.so.5 => /lib64/libgfortran.so.5 (0x00007f1291717000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f12916f5000)
libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f12916a9000)
/lib64/ld-linux-x86-64.so.2 (0x00007f129209e000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f129168e000)
For Fedora 32 (linked against openblas-threads), we have
# ldd /usr/lib64/python3.8/site-packages/numpy/linalg/lapack_lite.cpython-38-x86_64-linux-gnu.so
linux-vdso.so.1 (0x00007fff9a9fa000)
libopenblasp.so.0 => /lib64/libopenblasp.so.0 (0x00007f0ce9b92000)
libc.so.6 => /lib64/libc.so.6 (0x00007f0ce99c8000)
libm.so.6 => /lib64/libm.so.6 (0x00007f0ce9883000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0ce9861000)
libgfortran.so.5 => /lib64/libgfortran.so.5 (0x00007f0ce9596000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0cebfd4000)
libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f0ce954c000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f0ce952f000)
The presence of libdl.so.2
is the only difference, besides some reordering. I checked SciPy too, and we have the same problem for its SVD function (which was to be expected, because they have very similar configuration and build systems).
Maybe using the buildtime openblas key changes something internally?
According to numpy's documentation, they use dgesdd
from LAPACK to compute the SVD. In the benchmark, the SVD is called such that the singular values and the vectors are returned. I created a minimal working example to test this routines here: https://gist.githubusercontent.com/grisuthedragon/8956628ddde9af7505fb8c0d5c799c1e/raw/6d595f8006ef1129d16b60e82b4014e5e5a8d7ad/test_dgesdd.f90
Then I did the following on a FC33 VM / 4 Cores / Intel Sandybridge:
[root@fedora33 ~]# gfortran test_dgesdd.f90 -lflexiblas -o test_dgesdd.flexiblas
[root@fedora33 ~]# gfortran test_dgesdd.f90 -lopenblaso-r0.3.10 -o test_dgesdd.openblaso
[root@fedora33 ~]# gfortran test_dgesdd.f90 -lopenblas-r0.3.10 -o test_dgesdd.openblass
[root@fedora33 ~]# gfortran test_dgesdd.f90 -llapack -lblas -o test_dgesdd.netlib
[root@fedora33 ~]# FLEXIBLAS=OPENBLAS-OPENMP ./test_dgesdd.flexiblas
Time = 1.4351626060000000
[root@fedora33 ~]# FLEXIBLAS=OPENBLAS-SERIAL ./test_dgesdd.flexiblas
Time = 2.8791492609999998
[root@fedora33 ~]# ./test_dgesdd.openblaso
Time = 1.4242952959999999
[root@fedora33 ~]# ./test_dgesdd.openblass
Time = 2.9305672999999999
[root@fedora33 ~]# ./test_dgesdd.netlib
Time = 18.114536298000001
Regarding the runtimes we see that, as expected, FlexiBLAS-OpenBLAS and OpenBLAS naive are similar. But the Netlib time is high as expected. From this point of view it seems that in the Numpy case, FlexiBLAS falls back to its internal NETLIB fallback interface. I think it has something to do with the way Python and Numpy include their shared library addons. Using strange flags in dlopen
could be the reason. In R, as @Enchufa2 tested, this isn't a problem. I remeber from my first developments (around version 1 in 2013) with FlexiBLAS, that a colleague produced sometimes a "symbol not found" error, when using FlexiBLAS with Numpy.
I will take a deeper look in this during this week. But If anybody has some ideas, especially how to monitor what python is doing under the hood, I am interested in.
I think I have the reason. Python loads all shared objects with RLTD_LAZY | RTLD_LOCAL
and thus the first call to a LAPACK routine picks the first symbol in the GOT which fits the requirements. Since FlexiBLAS first integrates all Fallback symbols in the GLOBAL offset table during the initialization and the remaining ones(like everything from BLAS) it not looked up yet. The first call from Python/Numpy to LAPACK will pick the fallback we loaded in the GOT. I am currently developing the fix for this.
A short fix would be to replace the RTLD_GLOBAL flags in src/flexiblas.c / flexiblas_init by RTLD_LOCAL, but some first test show that this is only half of the truth.
Fixed in version 3.0.4
Reported in #1889069. Reproducible example:
Note the timings for SVD, Cholesky and Eigen decompositions. Now,