pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.76k stars 22.59k forks source link

Segmentation error for torch==2.2.1 on MacOs #121101

Open CloseChoice opened 8 months ago

CloseChoice commented 8 months ago

🐛 Describe the bug

At shap, we have run into problems with our CI jobs on macOs, e.g. see here. I tracked this down to an issue with torch==2.2.1.

Here is code to reproduce the issue (this works on torch==2.2.0):

import time

import torch
from sklearn.datasets import fetch_california_housing

def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

(execute with python -m pytest <filename>)

Stacktrace:

bash-3.2$ python -m pytest tests/explainers/test_segfault_minimal_example2.py                                                                                                                               
=========================================================================================== test session starts ============================================================================================
platform darwin -- Python 3.11.8, pytest-8.1.0, pluggy-1.4.0
Matplotlib: 3.8.3
Freetype: 2.6.1
rootdir: /Users/runner/work/shap/shap
configfile: pyproject.toml
plugins: cov-4.1.0, mpl-0.17.0
collected 1 item                                                                                                                                                                                           

tests/explainers/test_segfault_minimal_example2.py Fatal Python error: Segmentation fault

Thread 0x00000001140ad600 (most recent call first):
  File "/Users/runner/work/shap/shap/tests/explainers/test_segfault_minimal_example2.py", line 8 in test_something
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/python.py", line 1769 in runtestSegmentation fault: 11

Versions

PyTorch version: 2.2.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 12.7.3 (x86_64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: version 3.28.3
Libc version: N/A

Python version: 3.11.8 (v3.11.8:db85d51d3e, Feb  6 2024, 18:02:37) [Clang 13.0.0 (clang-1300.0.29.30)] (64-bit runtime)
Python platform: macOS-12.7.3-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.1
[pip3] torchvision==0.17.0
[conda] No relevant packages

cc @malfet @albanD @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

malfet commented 8 months ago

Is this reproducible if one uses Apple Silicon M1 runners? (Though Torch-2.2 is the last release to support Intel Macs per https://github.com/pytorch/pytorch/issues/114602 )

At least I can not reproduce it on M1, trying it in x86 Rosetta mode. Can not reproduce it in Rosetta environment either:

arch -arch x86_64 "/Applications/Python 3.11//IDLE.app/Contents/MacOS/Python" -mpytest ~/test/bug-121101.py

Nor can I repro in GitHub CI: https://github.com/malfet/deleteme/actions/runs/8150940508/job/22278030319?pr=79

connortann commented 8 months ago

I can reproduce in GitHub CI (over in the shap repo) with a slightly different setup:

I'll see if I can identify what the relevant difference is between that job and your run above- perhaps it's related to having different dependencies installed.

connortann commented 8 months ago

Reproducing the issue on GitHub Actions

I can reproduce the minimal reproducible example above on GitHub Actions, with the environment below.

The test snippet passes in an environment created with pip install pytest torch scikit-learn, but fails if the env also includes lightgbm.

The examples below ran on GitHub Actions with macos-latest, python=3.11.8, torch 2.2.1.

Reproducible example

As above:

import time

import torch
from sklearn.datasets import fetch_california_housing

def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

Passing run

Example passing run: https://github.com/shap/shap/actions/runs/8248044359/job/22557508223 Output of pip list:

``` Package Version ----------------- ----------- certifi 2024.2.2 filelock 3.13.1 fsspec 2024.2.0 iniconfig 2.0.0 Jinja2 3.1.3 joblib 1.3.2 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.2.1 numpy 1.26.4 packaging 24.0 pip 24.0 pluggy 1.4.0 pytest 8.1.1 scikit-learn 1.4.1.post1 scipy 1.12.0 setuptools 65.5.0 sympy 1.12 threadpoolctl 3.3.0 torch 2.2.1 typing_extensions 4.10.0 ```

Failing run

Example failing run: https://github.com/shap/shap/actions/runs/8248015803/job/22557423230 Output of pip list (identical apart from lightgbm):

``` Package Version ----------------- ----------- certifi 2024.2.2 filelock 3.13.1 fsspec 2024.2.0 iniconfig 2.0.0 Jinja2 3.1.3 joblib 1.3.2 lightgbm 4.3.0 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.2.1 numpy 1.26.4 packaging 24.0 pip 24.0 pluggy 1.4.0 pytest 8.1.1 scikit-learn 1.4.1.post1 scipy 1.12.0 setuptools 65.5.0 sympy 1.12 threadpoolctl 3.3.0 torch 2.2.1 typing_extensions 4.10.0 ```
MarcBresson commented 3 months ago

any news on that issue ? We are having the same problem.

connortann commented 3 months ago

Over at the "shap" project we are still seeing issue on CI, and it's preventing us from testing against the latest pytorch on MacOS. Example failing run here. We still see the issue with torch==2.4.0.

@malfet to help the investigation progress, here's a full minimal GitHub Actions workflow to reproduce the error:

# run_tests.yml
jobs:
  run_tests:
    runs-on: macos-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: 3.11
      - run: brew install libomp
      - run: pip install pytest torch scikit-learn lightgbm
      - run: pip list
      - run: pytest --noconftest test_bug.py
# test_bug.py
import time

import lightgbm
import torch
from sklearn.datasets import fetch_california_housing

def test_something():
    X, y = fetch_california_housing(return_X_y=True)
    torch.tensor(X)
    time.sleep(3)

Leads to Fatal Python error: Segmentation fault. Full output:

``` Run pytest --noconftest tests/test_bug121101.py ============================= test session starts ============================== platform darwin -- Python 3.11.9, pytest-8.3.2, pluggy-1.5.0 rootdir: /Users/runner/work/shap/shap configfile: pyproject.toml collected 1 item Fatal Python error: Segmentation fault Thread 0x0000000204c1cc00 (most recent call first): tests/test_bug121[10](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:11)1.py File "/Users/runner/work/shap/shap/tests/test_bug121101.py", line 12 in test_something File "/Library/Frameworks/Python.framework/Versions/3.[11](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:12)/lib/python3.11/site-packages/_pytest/python.py", line 159 in pytest_pyfunc_call File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line [12](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:13)0 in _hookexec File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 5[13](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:14) in __call__ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/python.py", line [16](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:17)27 in runtest File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line [17](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:18)4 in pytest_runtest_call File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 242 in File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 341 in from_call File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 241 in call_and_report File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 132 in runtestprotocol File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/runner.py", line 113 in pytest_runtest_protocol File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 362 in pytest_runtestloop File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line Fatal Python error: Segmentation fault 337 in _main File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 283 in wrap_session File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/main.py", line 330 in pytest_cmdline_main File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_callers.py", line 103 in _multicall File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_manager.py", line 120 in _hookexec File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pluggy/_hooks.py", line 513 in __call__ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 175 in main File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/_pytest/config/__init__.py", line 201 in console_main File "/Users/runner/hostedtoolcache/Python/3.11.9/arm64/bin/pytest", line 8 in Extension modules: numpy._core._multiarray_umath, numpy._core._multiarray_tests, numpy.linalg._umath_linalg, scipy._lib._ccallback_c, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt[19](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:20)937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator Extension modules: , numpy._core._multiarray_umathscipy.sparse._sparsetools, numpy._core._multiarray_tests, _csparsetools, numpy.linalg._umath_linalg, scipy.sparse._csparsetools, scipy._lib._ccallback_c, scipy.linalg._fblas, numpy.random._common, scipy.linalg._flapack, numpy.random.bit_generator, , scipy.linalg.cython_lapacknumpy.random._bounded_integers, , scipy.linalg._cythonized_array_utilsnumpy.random._mt19937, , scipy.linalg._solve_toeplitznumpy.random.mtrand, , numpy.random._philoxscipy.linalg._decomp_lu_cython, numpy.random._pcg64, scipy.linalg._matfuncs_sqrtm_triu, numpy.random._sfc64, scipy.linalg.cython_blas, numpy.random._generator, scipy.linalg._matfuncs_expm, scipy.sparse._sparsetools, scipy.linalg._decomp_update, _csparsetools, , scipy.sparse._csparsetoolsscipy.sparse.linalg._dsolve._superlu, , scipy.linalg._fblasscipy.sparse.linalg._eigen.arpack._arpack, scipy.linalg._flapack, , scipy.linalg.cython_lapackscipy.sparse.linalg._propack._spropack, scipy.linalg._cythonized_array_utils, scipy.sparse.linalg._propack._dpropack, scipy.linalg._solve_toeplitz, scipy.sparse.linalg._propack._cpropack, scipy.linalg._decomp_lu_cython, scipy.sparse.linalg._propack._zpropack, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.sparse.csgraph._tools, scipy.linalg._matfuncs_expm, scipy.sparse.csgraph._shortest_path, scipy.linalg._decomp_update, scipy.sparse.csgraph._traversal, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, , scipy.sparse.csgraph._min_spanning_treescipy.sparse.linalg._propack._spropack, , scipy.sparse.csgraph._flowscipy.sparse.linalg._propack._dpropack, , scipy.sparse.csgraph._matchingscipy.sparse.linalg._propack._cpropack, , scipy.sparse.csgraph._reorderingscipy.sparse.linalg._propack._zpropack, , scipy.sparse.csgraph._toolssklearn.__check_build._check_build, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, , scipy.sparse.csgraph._reorderingscipy.special._ufuncs_cxx, , sklearn.__check_build._check_buildscipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._ufuncs, scipy.spatial._ckdtree, scipy.special._specfun, scipy._lib.messagestream, scipy.special._comb, scipy.spatial._qhull, scipy.special._ellip_harm_2, scipy.spatial._voronoi, scipy.spatial._ckdtree, , scipy.spatial._distance_wrapscipy._lib.messagestream, , scipy.spatial._hausdorffscipy.spatial._qhull, scipy.spatial._voronoi, , scipy.spatial._distance_wrapscipy.spatial.transform._rotation, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.special.cython_special, scipy.stats._stats, scipy.stats._biasedurn, scipy.stats._levy_stable.levyst, scipy.stats._stats_pythran, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, scipy.stats._sobol, scipy.stats._qmc_cy, , scipy.optimize._group_columns, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNCscipy.stats._mvn, scipy.optimize._cobyla, scipy.stats._rcont.rcont, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.stats._unuran.unuran_wrapper, scipy.optimize._lsq.givens_elimination, , scipy.optimize._zeros, scipy.ndimage._nd_imagescipy.optimize._highs.cython.src._highs_wrapper, , scipy.optimize._highs._highs_wrapper_ni_label, , scipy.optimize._highs.cython.src._highs_constantsscipy.ndimage._ni_label, scipy.optimize._highs._highs_constants, sklearn.utils._isfinite, scipy.linalg._interpolative, sklearn.utils.sparsefuncs_fast, scipy.optimize._bglu_dense, sklearn.utils.murmurhash, scipy.optimize._lsap, , sklearn.utils._openmp_helpersscipy.optimize._direct, scipy.integrate._odepack, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, scipy.integrate._quadpack, sklearn.metrics._dist_metrics, scipy.integrate._vode, sklearn.metrics._pairwise_distances_reduction._datasets_pair, scipy.integrate._dop, scipy.integrate._lsoda, sklearn.utils._cython_blas, scipy.interpolate._fitpack, sklearn.metrics._pairwise_distances_reduction._base, scipy.interpolate._dfitpack, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, scipy.interpolate._bspl, sklearn.utils._heap, scipy.interpolate._ppoly, sklearn.utils._sorting, scipy.interpolate.interpnd, sklearn.metrics._pairwise_distances_reduction._argkmin, scipy.interpolate._rbfinterp_pythran, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, scipy.interpolate._rgi_cython, scipy.special.cython_special, sklearn.utils._vector_sentinel, scipy.stats._stats, , sklearn.metrics._pairwise_distances_reduction._radius_neighborsscipy.stats._biasedurn, , sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmodescipy.stats._levy_stable.levyst, , scipy.stats._stats_pythransklearn.metrics._pairwise_fast, scipy._lib._uarray._uarray, scipy.stats._ansari_swilk_statistics, sklearn.utils._random, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._mvn, torch._C, scipy.stats._rcont.rcont, , scipy.stats._unuran.unuran_wrappertorch._C._fft, , scipy.ndimage._nd_imagetorch._C._linalg, , _ni_labeltorch._C._nested, , scipy.ndimage._ni_labeltorch._C._nn, , sklearn.utils._isfinitetorch._C._sparse, , sklearn.utils.sparsefuncs_fasttorch._C._special, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, sklearn.utils._random, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, , scipy.io.matlab._mio_utilstorch._C._nn, torch._C._sparse, scipy.io.matlab._streams, torch._C._special, scipy.io.matlab._mio5_utils, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, , sklearn.datasets._svmlight_format_fastscipy.io.matlab._mio5_utils, sklearn.datasets._svmlight_format_fast, sklearn.feature_extraction._hashing_fast (total: 130, )sklearn.feature_extraction._hashing_fast (total: 130) /Users/runner/work/_temp/7013399c-b6ff-43a4-b289-cc08191dbadb.sh: line 1: 2783 Segmentation fault: 11 pytest --noconftest tests/test_bug1[21](https://github.com/shap/shap/actions/runs/10281087386/job/28449834033#step:7:22)101.py ```

Result of pip list:

``` Package Version ----------------- -------- certifi 2024.7.4 filelock 3.15.4 fsspec 2024.6.1 iniconfig 2.0.0 Jinja2 3.1.4 joblib 1.4.2 lightgbm 4.5.0 MarkupSafe 2.1.5 mpmath 1.3.0 networkx 3.3 numpy 2.0.1 packaging 24.1 pip 24.2 pluggy 1.5.0 pytest 8.3.2 scikit-learn 1.5.1 scipy 1.14.0 setuptools 65.5.0 sympy 1.13.1 threadpoolctl 3.5.0 torch 2.4.0 ```
malfet commented 3 months ago

@connortann thank you for the reproducer. Crash is due to multiple OpenMP runtimes loaded into the process address space:

$ lldb -- python bug-121101.py
(lldb) r
Process 16319 launched: '/Users/malfet/py3.12-torch2.4/bin/python' (arm64)
Process 16319 stopped
* thread #2, stop reason = exec
    frame #0: 0x0000000100014b70 dyld`_dyld_start
dyld`_dyld_start:
->  0x100014b70 <+0>:  mov    x0, sp
    0x100014b74 <+4>:  and    sp, x0, #0xfffffffffffffff0
    0x100014b78 <+8>:  mov    x29, #0x0 ; =0 
    0x100014b7c <+12>: mov    x30, #0x0 ; =0 
(lldb) c
Process 16319 resuming
Process 16319 stopped
* thread #3, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
  thread #4, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
  thread #5, stop reason = EXC_BAD_ACCESS (code=1, address=0x18)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
  thread #6, stop reason = EXC_BAD_ACCESS (code=1, address=0x20)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
  thread #8, stop reason = EXC_BAD_ACCESS (code=1, address=0x30)
    frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
->  0x106428cf0 <+48>: ldr    x19, [x8, w0, sxtw #3]
    0x106428cf4 <+52>: mov    x0, x19
    0x106428cf8 <+56>: bl     0x106428434    ; __kmp_suspend_initialize_thread
    0x106428cfc <+60>: mov    x0, x19
(lldb) image list libomp.dylib
[  0] E3A31AB3-3AE5-3371-87D0-7FD870A41A0D 0x00000001034f4000 /Users/malfet/py3.12-torch2.4/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib 
[  1] ACB8253B-DF8F-36C8-8100-C896CD3382ED 0x00000001063d4000 /opt/homebrew/Cellar/libomp/18.1.4/lib/libomp.dylib 
[  2] F53B1E01-AF16-30FC-8690-F7B131EB6CE5 0x0000000106744000 /Users/malfet/py3.12-torch2.4/lib/python3.12/site-packages/torch/lib/libomp.dylib 
(lldb) 
connortann commented 3 months ago

If I comment out the brew install libomp step on CI, we get a different error Library not loaded: **@rpath/libomp.dylib. From this comment, https://github.com/microsoft/LightGBM/issues/6262#issuecomment-1885303539 , the issue is apparently from OpenMP not being installed.

Full traceback if brew install libomp is commented out:

``` Run pytest --noconftest tests/test_bug121101.py ============================= test session starts ============================== platform darwin -- Python 3.11.9, pytest-8.3.2, pluggy-1.5.0 rootdir: /Users/runner/work/shap/shap configfile: pyproject.toml collected 0 items / 1 error ==================================== ERRORS ==================================== ___________________ ERROR collecting tests/test_bug121[10](https://github.com/shap/shap/actions/runs/10281535297/job/28451295814#step:6:11)1.py ___________________ tests/test_bug121101.py:5: in import lightgbm /Library/Frameworks/Python.framework/Versions/3.[11](https://github.com/shap/shap/actions/runs/10281535297/job/28451295814#step:6:12)/lib/python3.11/site-packages/lightgbm/__init__.py:9: in from .basic import Booster, Dataset, Sequence, register_logger /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/basic.py:281: in _LIB = _load_lib() /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/basic.py:265: in _load_lib lib = ctypes.cdll.LoadLibrary(lib_path[0]) /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ctypes/__init__.py:454: in LoadLibrary return self._dlltype(name) /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/ctypes/__init__.py:376: in __init__ self._handle = _dlopen(self._name, mode) E OSError: dlopen(/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.dylib, 0x0006): Library not loaded: @rpath/libomp.dylib E Referenced from: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.dylib E Reason: tried: '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file), '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file) =========================== short test summary info ============================ ERROR tests/test_bug[12](https://github.com/shap/shap/actions/runs/10281535297/job/28451295814#step:6:13)1101.py - OSError: dlopen(/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.dylib, 0x0006): Library not loaded: @rpath/libomp.dylib Referenced from: /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/lightgbm/lib/lib_lightgbm.dylib Reason: tried: '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file), '/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/opt/libomp/lib/libomp.dylib' (no such file), '/opt/local/lib/libomp/libomp.dylib' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/opt/local/lib/libomp/libomp.dylib' (no such file) !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! =============================== 1 error in 0.95s =============================== Error: Process completed with exit code 2. ```
malfet commented 3 months ago

To be frank, I'm unsure if problem lies solely with PyTorch at this point, as two other runtimes are importing libomp, and there isn't much one can do short of disabling OpenMP (which one can do programmatically by calling torch.set_num_threads(1) )

malfet commented 3 months ago

@connortann can you please try adding torch.set_num_threads(1) at the start of your test to let me know whether or not it fixes the problem. (it works for me locally)

connortann commented 3 months ago

Yep certainly: the tests do indeed pass with torch.set_num_threads(1).

I'm unsure if problem lies solely with PyTorch at this point

Indeed, as the segfault only to occurs when lightgbm is imported first. Possibly relevant, we had a separate segfault issue when torch is imported before lightgbm, as described in this comment: https://github.com/shap/shap/issues/3092#issuecomment-1636806906

I hope that collectively we can find a fix; as torch and lightgbm are both extremely popular libraries so it's quite common that they will be installed in the same environment.

connortann commented 3 months ago

I cross-posted to LightGBM, because as you say the problem doesn't seem to lie soley with pytorch: https://github.com/microsoft/LightGBM/issues/6595

yuygfgg commented 2 months ago

I'm going to add that this pytorch segmentation fault on macos do not necessarily need LightGBM. Some others like vapoursynth can cause the same problem.

lorentzenchr commented 1 month ago

As this issue requires a community effort, it is maybe best to centralize the discussion. @malfet would you be willing to join https://github.com/microsoft/LightGBM/issues/6595#issuecomment-2351398026.

starteleport commented 1 month ago

I am having this problem as well.

My objective is to run https://github.com/black-forest-labs/flux demo with PyTorch 2.4.1 on Intel MacBook Pro's Radeon 5500M.

What I've done so far:

After all that the segfault wouldn't go away.

I'm ready to dig into the issue, but I need some guidance/fresh ideas to facilitate the investigation.

lorentzenchr commented 3 weeks ago

@gchanan @dzhulgakov @ezyang @malfet If you could have a look and participate in the discussion in https://github.com/microsoft/LightGBM/issues/6595, that would be highly appreciated. I consider those kinds of bugs among the worst for users.

This issue is mainly caused by pytorch, the short summary of https://github.com/microsoft/LightGBM/issues/6595#issuecomment-2351398026 is:

torch vendors a libomp.dylib (without library or symbol name mangling) and always prefers that vendored copy to a system installation.

lightgbm searches for a system installation.

As a result, if you've installed both these libraries via wheels on macOS, loading both will result in 2 copies of libomp.dylib being loaded. This may or may not show up as runtime issues... unpredictable, because symbol resolution is lazy by default and therefore depends on the code paths used.

Even if all copies of libomp.dylib loaded into the process are ABI-compatible with each other, there can still be runtime segfaults as a result of mixing symbols from libraries loaded at different memory addresses, I think.