BUG: Hang on Windows in scikit-learn with 1.13rc1 and 1.14.dev (maybe due to OpenBLAS 0.3.26 update?)

lesteve commented 7 months ago

Describe your issue.

This was originally seen in MNE-Python CI roughly a week ago (March 13) using the scipy developement wheel (scientific-python-nightly-wheel) and reported in scikit-learn see https://github.com/scikit-learn/scikit-learn/issues/28625 for more details.

This can also be reproduced with scipy 1.13rc1. Could it be due to updating OpenBLAS to 0.3.26?

The scikit-learn code does nested parallelism, OpenBLAS within OpenMP. Setting OMP_NUM_THREADS=1 or OPENBLAS_NUM_THREADS=1 avoids the hang. A similar work-around can be used with threadpoolctl.

Insights or suggestions more than welcome!

Reproducing Code Example

from sklearn.metrics._pairwise_distances_reduction import ArgKmin
import numpy as np
import threadpoolctl

# Commenting any of these two lines avoids the hang
# threadpoolctl.threadpool_limits(limits=1, user_api='blas')
# threadpoolctl.threadpool_limits(limits=1, user_api='openmp')
X = np.zeros((20, 14000))
ArgKmin.compute(X=X, Y=X, k=10, metric='euclidean')

Error message

Hang, i.e. the script never completes. With one of the work-around it completes in less than one second.

SciPy/NumPy/Python version and system information

1.13.0rc1 1.26.4 sys.version_info(major=3, minor=12, micro=2, releaselevel='final', serial=0)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: /c/opt/64/include
    lib directory: /c/opt/64/lib
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=24
    pc file directory: c:/opt/64/lib/pkgconfig
    version: 0.3.26
  lapack:
    detection method: pkgconfig
    found: true
    include directory: /c/opt/64/include
    lib directory: /c/opt/64/lib
    name: openblas
    openblas configuration: USE_64BITINT= DYNAMIC_ARCH=1 DYNAMIC_OLDER= NO_CBLAS=
      NO_LAPACK= NO_LAPACKE= NO_AFFINITY=1 USE_OPENMP= ZEN MAX_THREADS=24
    pc file directory: c:/opt/64/lib/pkgconfig
    version: 0.3.26
  pybind11:
    detection method: config-tool
    include directory: unknown
    name: pybind11
    version: 2.11.1
Compilers:
  c:
    commands: cc
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  c++:
    commands: c++
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.9
  fortran:
    commands: gfortran
    linker: ld.bfd
    name: gcc
    version: 10.3.0
  pythran:
    include directory: C:\Users\runneradmin\AppData\Local\Temp\pip-build-env-t0bobn5y\overlay\Lib\site-packages/pythr
an
    version: 0.15.0
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: windows
  cross-compiled: false
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: windows
Python Information:
  path: C:\Users\runneradmin\AppData\Local\Temp\cibw-run-bxqwrdgx\cp312-win_amd64\build\venv\Scripts\python.exe
  version: '3.12'

lucascolley commented 7 months ago

cc @mattip

lesteve commented 7 months ago

Hmmm actually debugging further this does seem like an issue in OpenBLAS 0.3.26. I can reproduce the hang with conda-forge packages and scipy 1.12. With OpenBLAS 0.3.25 the snippet runs fine.

Conda env created with:

mamba create -n lof-issue-conda-2 scikit-learn ipython openblas=0.3.26 blas=*=*openblas* -y

Problematic environment info with OpenBLAS 0.3.26 (it does not seem like OpenBLAS 0.3.26 is mentioned somehow in the output below so I added sklearn.show_versions() below ...)

1.12.0 1.26.4 sys.version_info(major=3, minor=12, micro=2, releaselevel='final', serial=0)
Build Dependencies:
  blas:
    detection method: pkgconfig
    found: true
    include directory: C:/Users/rjxQE/AppData/Local/miniforge3/envs/lof-issue-conda-2/Library/include
    lib directory: C:/Users/rjxQE/AppData/Local/miniforge3/envs/lof-issue-conda-2/Library/lib
    name: blas
    openblas configuration: unknown
    pc file directory: C:\bld\scipy-split_1706041678996\_h_env\Library\lib\pkgconfig
    version: 3.9.0
  lapack:
    detection method: pkgconfig
    found: true
    include directory: C:/Users/rjxQE/AppData/Local/miniforge3/envs/lof-issue-conda-2/Library/include
    lib directory: C:/Users/rjxQE/AppData/Local/miniforge3/envs/lof-issue-conda-2/Library/lib
    name: lapack
    openblas configuration: unknown
    pc file directory: C:\bld\scipy-split_1706041678996\_h_env\Library\lib\pkgconfig
    version: 3.9.0
  pybind11:
    detection method: pkgconfig
    include directory: C:/Users/rjxQE/AppData/Local/miniforge3/envs/lof-issue-conda-2/Library/include
    name: pybind11
    version: 2.11.1
Compilers:
  c:
    commands: clang-cl
    linker: lld-link
    linker args: --target=x86_64-pc-windows-msvc, -nostdlib, -Xclang, --dependent-lib=msvcrt,
      -fuse-ld=lld, -Wl,-defaultlib:C:\bld\scipy-split_1706041678996\_build_env/Library/lib/clang/17/lib/windows/clan
g_rt.builtins-x86_64.lib
    name: clang-cl
    version: 17.0.6
  c++:
    commands: clang-cl
    linker: lld-link
    linker args: --target=x86_64-pc-windows-msvc, -nostdlib, -Xclang, --dependent-lib=msvcrt,
      -fuse-ld=lld, -Wl,-defaultlib:C:\bld\scipy-split_1706041678996\_build_env/Library/lib/clang/17/lib/windows/clan
g_rt.builtins-x86_64.lib
    name: clang-cl
    version: 17.0.6
  cython:
    commands: cython
    linker: cython
    name: cython
    version: 3.0.7
  fortran:
    args: -D_CRT_SECURE_NO_WARNINGS, -D_MT, -D_DLL, --target=x86_64-pc-windows-msvc,
      -nostdlib
    commands: flang-new
    linker: lld-link
    linker args: --target=x86_64-pc-windows-msvc, -nostdlib, -Xclang, --dependent-lib=msvcrt,
      -fuse-ld=lld, -Wl,-defaultlib:C:\bld\scipy-split_1706041678996\_build_env/Library/lib/clang/17/lib/windows/clan
g_rt.builtins-x86_64.lib,
      -D_CRT_SECURE_NO_WARNINGS, -D_MT, -D_DLL, --target=x86_64-pc-windows-msvc, -nostdlib
    name: flang
    version: 17.0.6
  pythran:
    include directory: ..\..\_h_env\Lib\site-packages\pythran
    version: 0.15.0
Machine Information:
  build:
    cpu: x86_64
    endian: little
    family: x86_64
    system: windows
  cross-compiled: false
  host:
    cpu: x86_64
    endian: little
    family: x86_64
    system: windows
Python Information:
  path: C:\bld\scipy-split_1706041678996\_h_env\python.exe
  version: '3.12'

sklearn.show_versions()

System:
    python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:42:31) [MSC v.1937 64 bit (AMD64)]
executable: C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.4.1.post1
          pip: 24.0
   setuptools: 69.2.0
        numpy: 1.26.4
        scipy: 1.12.0
       Cython: None
       pandas: None
   matplotlib: None
       joblib: 1.3.2
threadpoolctl: 3.3.0

Built with OpenMP: True

threadpoolctl info:
       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: libomp
       filepath: C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Library\bin\libomp.dll
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 4
         prefix: libblas
       filepath: C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Library\bin\libblas.dll
        version: 0.3.26
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 4
         prefix: vcomp
       filepath: C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\vcomp140.dll
        version: None

mattip commented 7 months ago

@martin-frbg does this look familiar?

The scikit-learn code does nested parallelism

@lesteve do you know how many threads/processes are opened?

martin-frbg commented 7 months ago

Not familiar, also not aware of any change in 0.3.26 that would be likely to cause it if the library was built with OpenMP support. (If not, there is a small chance that it might be related to https://github.com/OpenMathLib/OpenBLAS/pull/4359 - but again nothing of the sort known to date)

lesteve commented 7 months ago

Not familiar, also not aware of any change in 0.3.26 that would be likely to cause it if the library was built with OpenMP support.

I believe that both for wheels and conda-forge OpenBLAS is built with pthreads as the threading layer and not OpenMP. At least the scikit-learn output shows threading_layer: pthreads for both the wheel and conda-forge package.

I checked the threadpoolctl code and the threading_layer info is from openblas_get_parallel (following https://github.com/OpenMathLib/OpenBLAS/wiki/Faq#how-can-i-find-out-at-runtime-what-options-the-library-was-built-with-)

martin-frbg commented 7 months ago

would need to revert 4359 then to test - or simply copy the 0.3.25 driver/others/blas_server_win32.c into 0.3.26

mseminatore commented 7 months ago

@lesteve can you please clarify 'The scikit-learn code does nested parallelism' for me to help narrow this down? Am I correct in interpreting that to mean that the code invokes exec_blas_async(), rather than exec_blas(), multiple times in a row to queue up several batches of work asynchronously?

I will take a look when I can free up from other tasks. It will probably be about a week before I can dig into this so having a clear repro scenario will help speed things up.

lesteve commented 7 months ago

@lesteve can you please clarify 'The scikit-learn code does nested parallelism' for me to help narrow this down?

There is an outer OpenMP parallel loop and inside the OpenMP parallel loop there are some OpenBLAS calls. This is coming from Python so I am not sure about your question about exec_blas_async vs exec_blas. I may try to have a simpler reproducer in Cython at one point, but this may take some time ...

Another angle of attack I have tried and got stuck is to make sure https://github.com/OpenMathLib/OpenBLAS/pull/4359 is the issue. I am compiling OpenBLAS on Windows inside a MSYS2 terminal with make. At the end of the compilation I get a cygopenblas.dll. The command I use is:

make shared NOFORTRAN=1 NO_LAPACKE=1

I was naively hoping to copy this dll to my Python site-packages to replace libblas.dll (somewhere like C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-3\Library\bin\libblas.dll). if I do this numpy is failing to import so I must be doing something wrong. Any insights about this, let me know!

The ImportError looks like this:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Lib\site-packages\numpy\__init__.py", line 149, in <module>
    from . import lib
  File "C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Lib\site-packages\numpy\lib\__init__.py", line 23, in <module>
    from . import index_tricks
  File "C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Lib\site-packages\numpy\lib\index_tricks.py", line 12, in <module>
    import numpy.matrixlib as matrixlib
  File "C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Lib\site-packages\numpy\matrixlib\__init__.py", line 4, in <module>
    from . import defmatrix
  File "C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Lib\site-packages\numpy\matrixlib\defmatrix.py", line 12, in <module>
    from numpy.linalg import matrix_power
  File "C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Lib\site-packages\numpy\linalg\__init__.py", line 73, in <module>
    from . import linalg
  File "C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Lib\site-packages\numpy\linalg\linalg.py", line 35, in <module>
    from numpy.linalg import _umath_linalg
ImportError: DLL load failed while importing _umath_linalg: The specified module could not be found.

martin-frbg commented 7 months ago

Problematic environment info with OpenBLAS 0.3.26 (it does not seem like OpenBLAS 0.3.26 is mentioned somehow in >the output below so I added sklearn.show_versions() below ...)

1.12.0 1.26.4 sys.version_info(major=3, minor=12, micro=2, releaselevel='final', serial=0) Build Dependencies: blas: detection method: pkgconfig found: true include directory: C:/Users/rjxQE/AppData/Local/miniforge3/envs/lof-issue-conda-2/Library/include lib directory: C:/Users/rjxQE/AppData/Local/miniforge3/envs/lof-issue-conda-2/Library/lib name: blas openblas configuration: unknown pc file directory: C:\bld\scipy-split_1706041678996_h_env\Library\lib\pkgconfig version: 3.9.0

this looks strangely like it may have picked up a libblas from the "netlib" Reference-LAPACK project rather than OpenBLAS. Could it be that the various Python bits loaded multiple different BLAS implementations into the process ?

ogrisel commented 7 months ago

@lesteve can you try to mamba install "libopenblas=*=*openmp*" in your conda-forge env on windows to see if that fixes the problem on your reproducer? This command should switch the threading layer from pthreads to openmp and it the problem comes from nesting pthreads calls under OpenMP calls that might fix it.

And if it does, it might reveal a thread-safety problem of openblas' handling of its native threading layer under Windows.

The code in scikit-learn calls OpenBLAS under an OpenMP parallel loop. To avoid oversubscription, it does call openblas_set_num_threads(1) (via threadpoolctl.threadpool_limits) prior to entering the OpenMP parallel loop.

Maybe the recent changes in https://github.com/OpenMathLib/OpenBLAS/pull/4425 introduced a thread-safety problem in this threading layer when globally changing the number of threads at runtime?

lesteve commented 7 months ago

@ogrisel it looks like libopenblas with openmp is not available on conda-forge for Windows, only openmp with pthreads is, for example:

$ mamba search 'libopenblas>=0.3.25'

Loading channels: done
# Name                       Version           Build  Channel
libopenblas                   0.3.25 pthreads_hc140b1d_0  conda-forge
libopenblas                   0.3.26 pthreads_hc140b1d_0  conda-forge

martin-frbg commented 7 months ago

Maybe the recent changes in

I'd consider that unlikely especially if you're not using an OpenMP build of OpenBLAS in the first place

lesteve commented 7 months ago

It does seem like https://github.com/OpenMathLib/OpenBLAS/pull/4359 is indeed the cause of this behaviour.

I managed to build OpenBLAS on Windows following https://github.com/OpenMathLib/OpenBLAS/wiki/How-to-use-OpenBLAS-in-Microsoft-Visual-Studio (Native (MSVC) ABI way) and replace the dll.

the hang does happen on https://github.com/OpenMathLib/OpenBLAS/commit/e60fb0f39731ae9c21c5fd74d432f03b83d2a7d5 (merge commit for https://github.com/OpenMathLib/OpenBLAS/pull/4359)
there is not hang in the previous merge commit in the develop branch https://github.com/OpenMathLib/OpenBLAS/commit/5b09833b1c877ad1d395ee38791521ccd32386be

For completeness the way I used to setup the build:

cmake .. \
    -G "Ninja" -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_C_COMPILER=clang-cl \
    -DCMAKE_Fortran_COMPILER=flang -DCMAKE_MT=mt -DBUILD_WITHOUT_LAPACK=yes \
    -DNOFORTRAN=1 -DDYNAMIC_ARCH=OFF -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON

martin-frbg commented 7 months ago

~~weird as this should be a no-op if you are neither building with OpenMP support nor calling the newly introduced function~~

martin-frbg commented 7 months ago

wait, e60fb0f is the merge commit for Mark's 4359 (the windows thread server rewrite) not my 4425 - that's still unfortunate but a lot more plausible

lesteve commented 7 months ago

wait, e60fb0f is the merge commit for Mark's 4359 (the windows thread server rewrite) not my 4425 - that's still unfortunate but a lot more plausible

Indeed my bad, I edited my previous message to reflect this (I missed the fact that two PRs were mentioned I was focussed on https://github.com/OpenMathLib/OpenBLAS/pull/4359 all along).

ogrisel commented 7 months ago

So it seems that the combined use of external openmp loops with calls to openblas_set_num_threads(1) is causing a deadlock with the new windows thread server.

I am not familiar with windows dev, but is there a way to do attach a debugger like GDB to a running Python process (e.g. by pid) and then do thread apply all bt to locate where/when the deadlock happens?

lesteve commented 7 months ago

Not a Windows expert either, but this is what I get with ProcessExplorer Threads view:

The two running threads have some stack traces like this:

Thread 10892:

0x0000000000000000
KERNEL32.DLL!SwitchToThread
libblas.dll!exec_blas+0x122
libblas.dll!blas_level1_thread_with_return_value+0x329
libblas.dll!ddot_k+0xee
libblas.dll!ddot_+0x50
cython_blas.cp312-win_amd64.pyd!PyInit_cython_blas+0x231f8
cython_blas.cp312-win_amd64.pyd!PyInit_cython_blas+0xa4a0
_cython_blas.cp312-win_amd64.pyd+0xfd97
_base.cp312-win_amd64.pyd!PyInit__base+0xbd28
VCOMP140.DLL!vcomp_atomic_div_r8+0x181a
VCOMP140.DLL!vcomp_fork+0x2dd
VCOMP140.DLL!vcomp_fork+0x29c
VCOMP140.DLL!vcomp_atomic_div_r8+0x1ea
KERNEL32.DLL!BaseThreadInitThunk+0x14
ntdll.dll!RtlUserThreadStart+0x21

Thread 7124:

0x0000000000000000
ntdll.dll!ZwYieldExecution+0x14
KERNELBASE.dll!SwitchToThread+0x24
libblas.dll!exec_blas+0x122
libblas.dll!blas_level1_thread_with_return_value+0x329
libblas.dll!ddot_k+0xee
libblas.dll!ddot_+0x50
cython_blas.cp312-win_amd64.pyd!PyInit_cython_blas+0x231f8
cython_blas.cp312-win_amd64.pyd!PyInit_cython_blas+0xa4a0
_cython_blas.cp312-win_amd64.pyd+0xfd97
_base.cp312-win_amd64.pyd!PyInit__base+0xbd28
VCOMP140.DLL!vcomp_atomic_div_r8+0x181a
VCOMP140.DLL!vcomp_fork+0x2dd
VCOMP140.DLL!vcomp_fork+0x29c
VCOMP140.DLL!vcomp_atomic_div_r8+0x1ea
KERNEL32.DLL!BaseThreadInitThunk+0x14
ntdll.dll!RtlUserThreadStart+0x21

Thread 7832 (Python.exe so I guess main thread?)

ntdll.dll!ZwWaitForSingleObject+0x14
KERNELBASE.dll!WaitForSingleObjectEx+0x8e
VCOMP140.DLL!vcomp_ordered_loop_end+0x1f94
VCOMP140.DLL!omp_get_wtick+0x263
VCOMP140.DLL!vcomp_barrier+0xe6
_base.cp312-win_amd64.pyd!PyInit__base+0xbd5d
VCOMP140.DLL!vcomp_atomic_div_r8+0x181a
VCOMP140.DLL!vcomp_fork+0x2dd
VCOMP140.DLL!vcomp_fork+0x29c
VCOMP140.DLL!vcomp_atomic_div_r8+0xb83
VCOMP140.DLL!vcomp_fork+0x1bc
_base.cp312-win_amd64.pyd+0xcabb
_base.cp312-win_amd64.pyd+0xf291
_argkmin.cp312-win_amd64.pyd+0x1211c
_argkmin.cp312-win_amd64.pyd+0x10d8f
python312.dll!PyType_Name+0x1b10
_argkmin.cp312-win_amd64.pyd!PyInit__argkmin+0x3ef5
_argkmin.cp312-win_amd64.pyd+0xcf8f
_argkmin.cp312-win_amd64.pyd+0xcb49
python312.dll!PyBytes_Repeat+0xf1
python312.dll!PyObject_Vectorcall+0x35
python312.dll!PyEval_EvalFrameDefault+0x7d0f
python312.dll!PyEval_EvalCode+0xe6
python312.dll!PyRun_FileExFlags+0x296
python312.dll!PyRun_FileExFlags+0x393
python312.dll!PyRun_StringFlags+0x178
python312.dll!PyRun_SimpleFileObject+0x2fb
python312.dll!Py_gitidentifier+0xb251
python312.dll!Py_gitidentifier+0xbe82
python312.dll!Py_RunMain+0x18
python312.dll!Py_Main+0x5c
python.exe!OPENSSL_Applink+0x380
KERNEL32.DLL!BaseThreadInitThunk+0x14
ntdll.dll!RtlUserThreadStart+0x21

ogrisel commented 7 months ago

Apparently it's possible to "Configure symbols" after installing "Debugging Tools for Windows" as explained here:

https://superuser.com/a/462970

I have no idea what it entails but maybe it will give more information, (maybe even source code line numbers)?

ogrisel commented 7 months ago

Also, could you re-run your reproducer with faulthandler?

from sklearn.metrics._pairwise_distances_reduction import ArgKmin
import numpy as np
import threadpoolctl
import faulhandler

faulthandler.dump_traceback_later(timeout=10, repeat=False, file=sys.stderr, exit=False)

# Commenting any of these two lines avoids the hang
# threadpoolctl.threadpool_limits(limits=1, user_api='blas')
# threadpoolctl.threadpool_limits(limits=1, user_api='openmp')
X = np.zeros((20, 14000))
ArgKmin.compute(X=X, Y=X, k=10, metric='euclidean')

This should give us the Python-level tracebacks. We probably won't see the OpenMP and OpenBLAS managed threads but that might still be informative, in case the freeze happens when calling threadpoolctl.threadpool_limits(1, user_api="blas") in scikit-learn at lines:

https://github.com/scikit-learn/scikit-learn/blob/9b7d176b5a85d7dd681ffa69e55a82a3338096c5/sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp#L89-L93

martin-frbg commented 7 months ago

Hmm. My ignorant bet would have been on the new "shut down all surplus threads" loop somehow hanging on the openblas_set_num_threads(1), but it seems it made it to exec_blas() for the DDOT workload. Unfortunately I haven't the froggiest idea what could be wrong there, in particular as this does not appear to have been touched by the PR.

lesteve commented 7 months ago

The faulthandler info says:

Timeout (0:00:10)!
Thread 0x00000458 (most recent call first):
  File "C:\Users\rjxQE\AppData\Local\miniforge3\envs\lof-issue-conda-2\Lib\site-packages\sklearn\metrics\_pairwise_distances_reduction\_dispatcher.py", line 278 in compute
  File "C:\msys64\home\rjxQE\work\test-lof.py", line 13 in <module>

so this is probably there https://github.com/scikit-learn/scikit-learn/blob/719c0c6036ac99d45ffb45ed7c8b4e4b221384b5/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py#L278

Also I added the stack-trace of the Python (probably main thread I guess) in https://github.com/scipy/scipy/issues/20294#issuecomment-2012295803

I did not manage to get more info with configuring symbols, I am guessing this is because we would need to compile scikit-learn with debug symbols on. I am not sure I am ready to do this quite yet :wink:. I think spending some time to put together a simple reproducer in Cython may be more useful, but let me know if you disagree.

ogrisel commented 7 months ago

I updated my comment to point in the Cython file where I think this happens:

https://github.com/scipy/scipy/issues/20294#issuecomment-2012419439

The Python level function is the direct parent. Unfortunately, ArgKmin64.compute is a compiled extension (Cython) hence we cannot see the callees without the debug symbols. But it the deadlock had happened in the openblas_set_num_threads call under threadpool_limits I think we would have seen it in the Python traceback because threadpool_limits is a Python function (called by a Cython/native caller).

So it's most probably happening in the OpenMP Cython code (deadlock of the OpenMP runtime) or in the OpenBLAS DGEMM call. But we need debug symbols.

I think spending some time to put together a simple reproducer in Cython may be more useful, but let me know if you disagree.

I agree it that crafting a simpler reproducer in Cython or even pure C might be more productive.

martin-frbg commented 7 months ago

where does DGEMM come from now ? or did you mean DDOT as seen in the process traces ?

mseminatore commented 7 months ago

If you are able to build OpenBlas, it would be helpful to build a version with -DSMP_DEBUG. That will produce a log on stderr from the thread server detailing threads, task management, queue state, etc. Having that log would give a peek at the execution paths.

If the call tree goes through exec_blas() rather than exec_blas_async() my initial thoughts on what this might be are altered. I had suspected that a rapid series of calls to exec_blas_async() could somehow corrupt the task queue. But exec_blas() calls are serialized on queue completion.

Another hypothesis would be a series of (rapid) calls to openblas_set_num_threads(1). Unless you've set OPENBLAS_NUM_THREADS=1, the library on startup always spins up one thread per CPU core (up to MAX_THREADS). On a subsequent call to openblas_set_num_threads() new threads may be added, but extra threads were not killed. Extra threads would be left idle. I added code that trims the extra threads. If we have a simple repro, a quick test would be commenting out that code since it is a smaller change/revert.

A clarifying question about the scenario: Is OMP making parallel calls into exec_blas()? I think there is a (possibly bad on my part) assumption that exec_blas() was not re-entrant since it blocks on the task queue completing. If you can confirm, I will review that code to see if that could be the problem.

lesteve commented 7 months ago

Here is the log with SMP_DEBUG

Full disclosure I did not manage to do -DSMP_DEBUG through cmake although I did set CFLAGS=-DSMP_DEBUG and the ninja.build file looks like it is using the correct flag) so I ended editing the code directly to get the debug print statements ...

I added code that trims the extra threads. If we have a simple repro, a quick test would be commenting out that code since it is a smaller change/revert.

Just to make sure I follow you, I am guessing you mean commenting out this code right? https://github.com/OpenMathLib/OpenBLAS/blob/3d2a9e4a6134a9033236360cd87fd0bab2f7a70f/driver/others/blas_server_win32.c#L550-L572

Here are my best guess about your other questions (coming from the Python world I am slightly out of my comfort zone here :grin:):

I think we use exec_blas and not exec_blas_async
there may be a series of calls of openblas_set_num_threads(1) and then openblas_set_num_threads(number_of_cores) once we are out of the parallel OpenMP loop. This may happen multiple times but I am not sure how "rapid" they are and how "rapid" this needs to be to be problematic ...
Not 100% sure but I think indeed OpenMP is making parallel calls into exec_blas basically the following pattern probably happens quite a few times:
```
openblas_set_num_threads(1)
OpenMP parallel loop begin
  blas calls through numpy/scipy
OpenMP parallel loop end
openblas_set_num_threads(number_of_cores)
```
and there are likely also some other blas calls outside of OpenMP parallel regions

mseminatore commented 7 months ago

@lesteve Yes, those lines of code are for trimming extra threads. It would be my fault for making assumptions, but I don't think I was expecting a use-case that ping-ponged between N threads, 1 thread and then N threads again, potentially with re-entrant calls to exec_blas().

I will go over the code to see if I can find an issue with that pattern.

ilayn commented 7 months ago

Is there any detail about which SciPy function is called by the scikit-learn function that causes the hang? I am trying to find the SciPy part of the problem.

lesteve commented 7 months ago

So far I have not been able to reproduce the dead-lock outside the scikit-learn code (i.e. only with Scipy and Cython) and I have found a work-around in scikit-learn which is to protect a OpenBLAS call with OpenMP parallel loop by setting the OpenBLAS number of threads to 1 with threadpoolctl:

@@ -36,8 +37,9 @@ cdef float64_t[::1] _sqeuclidean_row_norms64_dense(
         intp_t d = X.shape[1]
         float64_t[::1] squared_row_norms = np.empty(n, dtype=np.float64)

-    for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads):
-        squared_row_norms[idx] = _dot(d, X_ptr + idx * d, 1, X_ptr + idx * d, 1)
+    with threadpool_limits(limits=1, user_api='blas'):
+        for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads):
+            squared_row_norms[idx] = _dot(d, X_ptr + idx * d, 1, X_ptr + idx * d, 1)

_dot is actually calling scipy.linalg.cython_blas.ddot which I was naively expecting to be single-threaded so I don't really know why calling openblas_set_num_threads(1) helps but it does ...

ilayn commented 7 months ago

Can you please also try with dnrm2 and (square the result) to get the same number so we can at least triangulate the issue to ddot ?

mseminatore commented 7 months ago

The call stack above included this entry:

libblas.dll!blas_level1_thread_with_return_value+0x329

which says that threading of ddot was requested. I believe the code that calls this is in kernel/x86_64/ddot.c (this is the generic C cpu path, but assembly kernels should follow a similar pattern). If only one thread is requested then the ddot kernel is called directly, otherwise the kernel is handed to the level1 threading dispatch.

#if defined(SMP)
    if (inc_x == 0 || inc_y == 0 || n <= 10000)
        nthreads = 1;
    else
        nthreads = num_cpu_avail(1);

    if (nthreads == 1) {
        dot = dot_compute(n, x, inc_x, y, inc_y);
    } else {
        int mode, i;
        char result[MAX_CPU_NUMBER * sizeof(double) * 2];
        RETURN_TYPE *ptr;

#if !defined(DOUBLE)
        mode = BLAS_SINGLE  | BLAS_REAL;
#else
        mode = BLAS_DOUBLE  | BLAS_REAL;
#endif
        blas_level1_thread_with_return_value(mode, n, 0, 0, &dummy_alpha,
                   x, inc_x, y, inc_y, result, 0,
                    (int (*)(void)) dot_thread_function, nthreads);

lesteve commented 7 months ago

Thanks for your input, I managed to reproduce with Cython and Scipy with this Cython code:

import numpy as np
from cython.parallel import prange
from scipy.linalg.cython_blas cimport ddot

cdef parallel_ddot(const double[::1] X, int num_threads):
    cdef:
        double *X_ptr = <double *> &X[0]
        int n = <int> X.shape[0]
        int inc = 1
        int idx

    for idx in prange(10, nogil=True, num_threads=num_threads):
        ddot(&n, X_ptr, &inc, X_ptr, &inc)

A = np.ones(14000)
parallel_ddot(A, 4)

and the setup.py used to compile it:

from setuptools import Extension, setup
from Cython.Build import cythonize

ext_modules = [
    Extension(
        "test_cython_sklearn",
        ["test_cython_sklearn.pyx"],
        extra_compile_args=['/openmp'],
        extra_link_args=['/openmp'],
    )
]

setup(
    name='test_cython_sklearn',
    ext_modules=cythonize(ext_modules),
)

@ilayn using dnrm2 rather than ddot on the same data did not reproduce the hang, I guess maybe dnrm2 does not use multi-threading?

ev-br commented 7 months ago

Can you try a fortran reproducer below?

$ cat ddot.f90 
implicit none

integer, parameter :: n = 140000
integer :: incx = 1
real*8 :: x(n), ddt

real*8, external :: ddot

x = 1.0d0

ddt = ddot(n, x, incx, x, incx)

print*, 'n =', n, " dot(x, x) =", ddt

end

$ gfortran ddot.f90 -lblas
$ ./a.out 
 n =      140000  dot(x, x) =   140000.00000000000

I'm on linux so everything works fine for me:

$ ldd a.out 
    linux-vdso.so.1 (0x00007ffdf0953000)
    libopenblas.so.0 => /home/ev-br/.conda/envs/scipy-dev/lib/libopenblas.so.0 (0x00007fedc24a8000)
    libgfortran.so.5 => /home/ev-br/.conda/envs/scipy-dev/lib/libgfortran.so.5 (0x00007fedc22fd000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fedc21ff000)
    libgcc_s.so.1 => /home/ev-br/.conda/envs/scipy-dev/lib/libgcc_s.so.1 (0x00007fedc21e4000)
    libquadmath.so.0 => /home/ev-br/.conda/envs/scipy-dev/lib/libquadmath.so.0 (0x00007fedc21ab000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fedc1f82000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fedc1f7d000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fedc47a4000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fedc1f78000)

linking to -lblas -lgomp does not change anything either.

ilayn commented 7 months ago

@ilayn using dnrm2 rather than ddot on the same data did not reproduce the hang, I guess maybe dnrm2 does not use multi-threading?

Quite possible. But if you are not hit by any performance hits, better use dnrm2 that has better overflow/underflow protection instead of a direct dot product for euclidean norms.

martin-frbg commented 7 months ago

right, nrm2 on x86_64 is not multithreaded

lesteve commented 7 months ago

@ev-br sorry I have no idea how to compile a Fortran file on Windows (my development environment is Linux) ... a naive attempt with flang-new got me nowhere.

I am reasonably sure that the problem only happens on Windows and that you need OpenBLAS multi-threading within OpenMP multi-threading to trigger the dead-lock.

ev-br commented 7 months ago

Sure, here's a C reproducer, might be easier on Windows:

$ cat ddot.c
#include<stdio.h>
#include<stdlib.h>
#include "cblas.h"

int main(){

int n = 14000;
int incx = 1;
double *x, ddt;

x = (double*)malloc(n*sizeof(double));

for (int i=0; i<n; i++){
     x[i] = 1.0;
}

ddt = cblas_ddot(n, x, incx, x, incx);
printf("dot(x, x) = %f\n", ddt);
free(x);
}

ISTM it would be useful to see if the problem is with OpenBLAS itself or with Cython.

rgommers commented 7 months ago

ISTM it would be useful to see if the problem is with OpenBLAS itself or with Cython.

This will be quite useful (with the note that it's mostly about taking SciPy out of the equation rather than Cython itself). The reproducer should use nested parallelism though with an explicit OpenMP call, otherwise nothing will be wrong.

ev-br commented 7 months ago

$ cat ddot.c 
#include<stdio.h>
#include<stdlib.h>
#include "cblas.h"

int main(){

int n = 14000;
int incx = 1;
double *x, ddt;

x = (double*)malloc(n*sizeof(double));

for (int i=0; i<n; i++){
     x[i] = 1.0;
}

#pragma omp parallel
{
#pragma omp parallel for
for (int i=0; i<10; i++){
   ddt = cblas_ddot(n, x, incx, x, incx);
}
}
printf("dot(x, x) = %f", ddt);
free(x);
}

mseminatore commented 7 months ago

@ev-br I am curious (and will try it out myself), but does the non-OMP version reproduce the issue?

And is that code representative of how scipy is calling OpenBlas, or is it just a simple repro? I am trying to understand desired calling pattern. I assume that doing the same calculation on the same data (redundantly) in parallel is not the actual scenario.

ev-br commented 7 months ago

I cannot repro the issue with or without OpenMP on Linux. (Edited to remove autocorrection nonsense).

My repro above is indeed a simple example, I assume scikit-learn does something else, not just repeated computations in a loop. I suppose this example can made closer to what the OP was doing, in an environment where the OP issue is reproducible.

lesteve commented 7 months ago

For completeness sake, below is a Cython snippet that is close to what scikit-learn is doing, i.e. computing squared norms for each row of a 2d array in parallel.

cdef sqeuclidean_row_norms(const float64_t[::1] X intp_t num_threads):
    """Compute the squared euclidean norm of the rows of X in parallel.

    This is faster than using np.einsum("ij, ij->i") even when using a single thread.
    """
    cdef:
        # Casting for X to remove the const qualifier is needed because APIs
        # exposed via scipy.linalg.cython_blas aren't reflecting the arguments'
        # const qualifier.
        # See: https://github.com/scipy/scipy/issues/14262
        float64_t * X_ptr = <float64_t *> &X[0, 0]
        intp_t idx = 0
        int inc = 1
        intp_t n = X.shape[0]
        intp_t d = X.shape[1]
        int id = <int> d
        intp_t thread_num;
        float64_t[::1] squared_row_norms = np.empty(n, dtype=np.float64)

    # with threadpool_limits(limits=1, user_api='blas'):
    # with threadpool_limits(limits=4, user_api='blas'):
    cdef int nb_threads = omp_get_max_threads()
    printf("number of threads: %d", nb_threads)
    for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads):
        squared_row_norms[idx] = ddot(&id, X_ptr + idx * d, &inc, X_ptr + idx * d, &inc)

    return squared_row_norms

lesteve commented 7 months ago

I have managed to compile the C snippet from https://github.com/scipy/scipy/issues/20294#issuecomment-2018273698 on my Windows VM and indeed I can reproduce the observed behaviour:

the dead-lock does happen on https://github.com/OpenMathLib/OpenBLAS/commit/e60fb0f39731ae9c21c5fd74d432f03b83d2a7d5 (merge commit for https://github.com/OpenMathLib/OpenBLAS/pull/4359)
there is not dead-lock in the previous merge commit in the develop branch https://github.com/OpenMathLib/OpenBLAS/commit/5b09833b1c877ad1d395ee38791521ccd32386be

That seems to indicate that indeed the problem lies in OpenBLAS and not Scipy so I am going to close this issue. I have created https://github.com/OpenMathLib/OpenBLAS/issues/4582 to track the problem in the OpenBLAS project.

For completeness, in the mean-time we have a work-around in scikit-learn https://github.com/scikit-learn/scikit-learn/pull/28692.

tylerjereddy commented 7 months ago

If you don't mind, I'm going to leave this open until we take an action that fixes the problem in our wheels for the 1.13.0 release process. It is easy for me to forget about stuff otherwise, especially with the NumPy 2.0 support challenge coming up on top of it.

mseminatore commented 7 months ago

I am looking over the code now. From the provided log I think I know what is happening so I am looking to see if I can identify a safe fix. And test to prove it addresses the issue with the repro case.

Open questions that I have:

How many OpenBlas threads are set at the time OMP calls to cblas_ddot() happen in parallel? Is it 1 or is it N? The provided log file suggests that the thread server was initialized with 4 threads (main thread plus 3 worker threads). But unfortunately, current logging doesn't tell us whether that number was lowered subsequently via openblas_set_num_threads().
If OpenBlas thread count was 1, is the reasoning to constrain how many threads OpenBlas creates and that OMP provides the threads that allow for parallelization? If so, then the higher-level code must be applying some logic to break work into chunks that are dispatched via parallel calls to cblas_ddot(). Can someone confirm that is the case?
@martin-frbg is exec_blas() intended to support this calling pattern and be re-entrant? I do not see that documented so seeking clarification before I propose specific solutions.

martin-frbg commented 6 months ago

@mseminatore there are no restrictions in the C reproducer, so it would be N threads both in the OpenMP parallel call and in the OpenBLAS instance(s) it creates. This is certainly not ideal as the two thread pools do not "know" of each other (an OpenBLAS built with OpenMP would restrict itself to a single thread by default when it detects that it is inside a parallel region, a pthreads-only OpenBLAS cannot know). However, this pattern used to work in practice up to your PR (and still works on non-Windows platforms), so should definitely remain supported.

tylerjereddy commented 6 months ago

Keep me in the loop here. With pybind11 2.12.0 out the door now, NumPy 2.0.0rc1 may be imminent, which then puts some pressure on us to get SciPy 1.13.0 finalized.

If OpenBLAS has an unreleased-but-fixed version I may ask Matti to help me get that into a form that we can consume for wheel builds.

mseminatore commented 6 months ago

@lesteve @tylerjereddy @martin-frbg I have been able to reliably reproduce this issue, believe I have determined the cause and am testing the fix now. Once testing confirms the fix I will send over a PR. ETA is today.

The good news is that if this addresses the issue the change is very small, an unfortunate bug in task queue management that only shows up with nested parallelism.

@martin-frbg we should add a test case for this scenario - ideally one that fails now with the current code so that we can confirm the fix addresses that failing test case. I am not familiar with how to integrate a utest that uses OMP. Once I have a confirmed fix I will explore that but any pointers to accelerate me are appreciated.

martin-frbg commented 6 months ago

~~Thanks - converting the reproducer to an utest module should be no problem as long as it is guarded by an appropriate USE_OPENMP conditional in the Makefile/CMakeLists.txt~~

martin-frbg commented 6 months ago

Hmm... on second thought not so easy as we want an OpenMP outer loop with a non-OpenMP OpenBLAS i.e. it is not even guaranteed at this point that the build environment is capable of OpenMP.

scipy / scipy