Open CloseChoice opened 8 months ago
Is this reproducible if one uses Apple Silicon M1 runners? (Though Torch-2.2 is the last release to support Intel Macs per https://github.com/pytorch/pytorch/issues/114602 )
At least I can not reproduce it on M1, trying it in x86 Rosetta mode. Can not reproduce it in Rosetta environment either:
arch -arch x86_64 "/Applications/Python 3.11//IDLE.app/Contents/MacOS/Python" -mpytest ~/test/bug-121101.py
Nor can I repro in GitHub CI: https://github.com/malfet/deleteme/actions/runs/8150940508/job/22278030319?pr=79
I can reproduce in GitHub CI (over in the shap repo) with a slightly different setup:
I'll see if I can identify what the relevant difference is between that job and your run above- perhaps it's related to having different dependencies installed.
I can reproduce the minimal reproducible example above on GitHub Actions, with the environment below.
The test snippet passes in an environment created with pip install pytest torch scikit-learn
, but fails if the env also includes lightgbm
.
The examples below ran on GitHub Actions with macos-latest
, python=3.11.8
, torch 2.2.1
.
As above:
import time
import torch
from sklearn.datasets import fetch_california_housing
def test_something():
X, y = fetch_california_housing(return_X_y=True)
torch.tensor(X)
time.sleep(3)
Example passing run: https://github.com/shap/shap/actions/runs/8248044359/job/22557508223
Output of pip list
:
Example failing run: https://github.com/shap/shap/actions/runs/8248015803/job/22557423230
Output of pip list
(identical apart from lightgbm):
any news on that issue ? We are having the same problem.
Over at the "shap" project we are still seeing issue on CI, and it's preventing us from testing against the latest pytorch on MacOS. Example failing run here. We still see the issue with torch==2.4.0
.
@malfet to help the investigation progress, here's a full minimal GitHub Actions workflow to reproduce the error:
# run_tests.yml
jobs:
run_tests:
runs-on: macos-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: 3.11
- run: brew install libomp
- run: pip install pytest torch scikit-learn lightgbm
- run: pip list
- run: pytest --noconftest test_bug.py
# test_bug.py
import time
import lightgbm
import torch
from sklearn.datasets import fetch_california_housing
def test_something():
X, y = fetch_california_housing(return_X_y=True)
torch.tensor(X)
time.sleep(3)
Leads to Fatal Python error: Segmentation fault
. Full output:
Result of pip list
:
@connortann thank you for the reproducer. Crash is due to multiple OpenMP runtimes loaded into the process address space:
$ lldb -- python bug-121101.py
(lldb) r
Process 16319 launched: '/Users/malfet/py3.12-torch2.4/bin/python' (arm64)
Process 16319 stopped
* thread #2, stop reason = exec
frame #0: 0x0000000100014b70 dyld`_dyld_start
dyld`_dyld_start:
-> 0x100014b70 <+0>: mov x0, sp
0x100014b74 <+4>: and sp, x0, #0xfffffffffffffff0
0x100014b78 <+8>: mov x29, #0x0 ; =0
0x100014b7c <+12>: mov x30, #0x0 ; =0
(lldb) c
Process 16319 resuming
Process 16319 stopped
* thread #3, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
-> 0x106428cf0 <+48>: ldr x19, [x8, w0, sxtw #3]
0x106428cf4 <+52>: mov x0, x19
0x106428cf8 <+56>: bl 0x106428434 ; __kmp_suspend_initialize_thread
0x106428cfc <+60>: mov x0, x19
thread #4, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
-> 0x106428cf0 <+48>: ldr x19, [x8, w0, sxtw #3]
0x106428cf4 <+52>: mov x0, x19
0x106428cf8 <+56>: bl 0x106428434 ; __kmp_suspend_initialize_thread
0x106428cfc <+60>: mov x0, x19
thread #5, stop reason = EXC_BAD_ACCESS (code=1, address=0x18)
frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
-> 0x106428cf0 <+48>: ldr x19, [x8, w0, sxtw #3]
0x106428cf4 <+52>: mov x0, x19
0x106428cf8 <+56>: bl 0x106428434 ; __kmp_suspend_initialize_thread
0x106428cfc <+60>: mov x0, x19
thread #6, stop reason = EXC_BAD_ACCESS (code=1, address=0x20)
frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
-> 0x106428cf0 <+48>: ldr x19, [x8, w0, sxtw #3]
0x106428cf4 <+52>: mov x0, x19
0x106428cf8 <+56>: bl 0x106428434 ; __kmp_suspend_initialize_thread
0x106428cfc <+60>: mov x0, x19
thread #8, stop reason = EXC_BAD_ACCESS (code=1, address=0x30)
frame #0: 0x0000000106428cf0 libomp.dylib`void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>*) + 48
libomp.dylib`__kmp_suspend_64<false, true>:
-> 0x106428cf0 <+48>: ldr x19, [x8, w0, sxtw #3]
0x106428cf4 <+52>: mov x0, x19
0x106428cf8 <+56>: bl 0x106428434 ; __kmp_suspend_initialize_thread
0x106428cfc <+60>: mov x0, x19
(lldb) image list libomp.dylib
[ 0] E3A31AB3-3AE5-3371-87D0-7FD870A41A0D 0x00000001034f4000 /Users/malfet/py3.12-torch2.4/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib
[ 1] ACB8253B-DF8F-36C8-8100-C896CD3382ED 0x00000001063d4000 /opt/homebrew/Cellar/libomp/18.1.4/lib/libomp.dylib
[ 2] F53B1E01-AF16-30FC-8690-F7B131EB6CE5 0x0000000106744000 /Users/malfet/py3.12-torch2.4/lib/python3.12/site-packages/torch/lib/libomp.dylib
(lldb)
If I comment out the brew install libomp
step on CI, we get a different error Library not loaded: **@rpath/libomp.dylib
.
From this comment, https://github.com/microsoft/LightGBM/issues/6262#issuecomment-1885303539 , the issue is apparently from OpenMP not being installed.
Full traceback if brew install libomp
is commented out:
To be frank, I'm unsure if problem lies solely with PyTorch at this point, as two other runtimes are importing libomp, and there isn't much one can do short of disabling OpenMP (which one can do programmatically by calling torch.set_num_threads(1)
)
@connortann can you please try adding torch.set_num_threads(1)
at the start of your test to let me know whether or not it fixes the problem. (it works for me locally)
Yep certainly: the tests do indeed pass with torch.set_num_threads(1)
.
I'm unsure if problem lies solely with PyTorch at this point
Indeed, as the segfault only to occurs when lightgbm is imported first. Possibly relevant, we had a separate segfault issue when torch is imported before lightgbm, as described in this comment: https://github.com/shap/shap/issues/3092#issuecomment-1636806906
I hope that collectively we can find a fix; as torch and lightgbm are both extremely popular libraries so it's quite common that they will be installed in the same environment.
I cross-posted to LightGBM, because as you say the problem doesn't seem to lie soley with pytorch: https://github.com/microsoft/LightGBM/issues/6595
I'm going to add that this pytorch segmentation fault on macos do not necessarily need LightGBM. Some others like vapoursynth can cause the same problem.
As this issue requires a community effort, it is maybe best to centralize the discussion. @malfet would you be willing to join https://github.com/microsoft/LightGBM/issues/6595#issuecomment-2351398026.
I am having this problem as well.
My objective is to run https://github.com/black-forest-labs/flux demo with PyTorch 2.4.1 on Intel MacBook Pro's Radeon 5500M.
What I've done so far:
pytorch/builder
called the way it used to be called from CircleCI before x64 was droppedpython -c "import torch; print(torch.backends.mps.is_available())"
flux
torchvision
myself as well, because it references torch
package and would otherwise conflicttorchvision
and installed the wheel into venvflux
script with DYLD_PRINT_LIBRARIES=1
and noticed that libiomp5.dylib
is being imported both from torch
and functorch
functorch
with my torch
wheelAfter all that the segfault wouldn't go away.
I'm ready to dig into the issue, but I need some guidance/fresh ideas to facilitate the investigation.
@gchanan @dzhulgakov @ezyang @malfet If you could have a look and participate in the discussion in https://github.com/microsoft/LightGBM/issues/6595, that would be highly appreciated. I consider those kinds of bugs among the worst for users.
This issue is mainly caused by pytorch, the short summary of https://github.com/microsoft/LightGBM/issues/6595#issuecomment-2351398026 is:
torch vendors a libomp.dylib (without library or symbol name mangling) and always prefers that vendored copy to a system installation.
lightgbm searches for a system installation.
As a result, if you've installed both these libraries via wheels on macOS, loading both will result in 2 copies of libomp.dylib being loaded. This may or may not show up as runtime issues... unpredictable, because symbol resolution is lazy by default and therefore depends on the code paths used.
Even if all copies of libomp.dylib loaded into the process are ABI-compatible with each other, there can still be runtime segfaults as a result of mixing symbols from libraries loaded at different memory addresses, I think.
🐛 Describe the bug
At shap, we have run into problems with our CI jobs on macOs, e.g. see here. I tracked this down to an issue with
torch==2.2.1
.Here is code to reproduce the issue (this works on
torch==2.2.0
):(execute with
python -m pytest <filename>
)Stacktrace:
Versions
cc @malfet @albanD @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10