Open connortann opened 1 month ago
Thanks for the excellent report @connortann !
Since #6391, import lightgbm
on macOS will try to use the already-loaded OpenMP if there is one. So it shouldn't be the case that import lightgbm
can cause "multiple OpenMP runtimes being loaded".
(assuming that was typo in your original report and you really mean "OpenMP", not "OpenML")
Since you have scikit-learn
in the environment, import lightgbm
will import sklearn
. I suspect that scikit-learn
may be contributing to this problem. In the past, we've seen that library's handling of its OpenMP dependency contribute to this "multiple OpenMP runtimes being loaded" situation.
To narrow it down further, could you try 2 other tests?
import sklearn
before / after torch
(no lightgbm
involved)pip uninstall --yes scikit-learn
and then testing import lightgbm
before / after torch
I'm sorry to possibly involve yet a THIRD project in your investigation. I'm familiar with these topics and happy to help us all reach a resolution.
You may also find these relevant:
Thanks for the response! Yes I think you're right about sklearn being relevant: the bug seems not to occur if sklearn is not imported.
Here's what I tried: the tests pass in all these situations
sklearn
, then torch
(no lightgbm
involved). Tests pass.torch
then sklearn
(no lightgbm
involved). Tests pass.lightgbm
then torch
. Tests passtorch
then lightgbm
. Tests passSo, I think the example above is the minimal reproducer: lightgbm
, torch
and sklearn
!
Adding my two cents to this issue. I managed to reproduce the bug following the setting given by @connortann
Running the following command raises the segfault
python -m pytest test_bug.py
with
torch==2.2.2
scikit-learn==1.5.1
numpy==1.26.4
lightgbm==4.5.0
but if prepending the command with OMP_NUM_THREADS=1
(forcing single thread operations) then it irons out the segfault.
@lesteve ping as scikit-learn is involved in the minimal reproducer (openmp related).
Honestly @jeremiedbb may be a better person on this on the scikit-learn side. To be honest this is quite a tricky topic at the interface of different projects which make different choices how to tackle OpenMP with wheels and OpenMP in itself is already tricky.
The root cause is generally using multiple OpenMP and using threadpoolctl
can highlight this, see this doc and below.
One known work-around is to use conda-forge which will use a single OpenMP and avoid most of these issues. I wanted to mention it, even if I understand using conda rather than pip is a non-starter in some use cases.
In this particular case, I played a bit with the code and can reproduce without scikit-learn, i.e. only with LightGBM and PyTorch. To be honest, I have heard of cases that go wrong with PyTorch and scikit-learn for similar reasons, but it's generally a bit hard to get a reproducer ...
I put together a quick repo: https://github.com/lesteve/lightgbm-pytorch-macos-segfault.
In particular, see build log which shows a segfault, python file, worflow YAML file. Importing pytorch before lightgbm works fine, see build log.
Python file:
import pprint
import sys
import platform
import lightgbm
import torch
import threadpoolctl
print('version: ', sys.version, flush=True)
print('platform: ', platform.platform(), flush=True)
pprint.pprint(threadpoolctl.threadpool_info())
print('before torch tensor', flush=True)
t = torch.ones(200_000)
print('after torch tensor', flush=True)
Output:
version: 3.12.5 (v3.12.5:ff3bc82f7c9, Aug 7 2024, 05:32:06) [Clang 13.0.0 (clang-1300.0.29.30)]
platform: macOS-14.6.1-arm64-arm-64bit
[{'architecture': 'armv8',
'filepath': '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/numpy/.dylibs/libopenblas64_.0.dylib',
'internal_api': 'openblas',
'num_threads': 3,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23.dev'},
{'filepath': '/opt/homebrew/Cellar/libomp/18.1.8/lib/libomp.dylib',
'internal_api': 'openmp',
'num_threads': 3,
'prefix': 'libomp',
'user_api': 'openmp',
'version': None},
{'filepath': '/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/torch/lib/libomp.dylib',
'internal_api': 'openmp',
'num_threads': 3,
'prefix': 'libomp',
'user_api': 'openmp',
'version': None}]
before torch tensor
/Users/runner/work/_temp/558d95ac-031b-4858-bfb0-b7bb4841e27b.sh: line 1: 1924 Segmentation fault: 11 python test.py
From the threadpoolctl info, you can tell that there are multiple OpenMP in use the brew one (from LightGBM) and the PyTorch one bundled in the wheel.
pip list
Package Version
----------------- ---------
certifi 2024.8.30
filelock 3.16.0
fsspec 2024.9.0
Jinja2 3.1.4
lightgbm 4.5.0
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.3
numpy 1.26.4
pip 24.2
scipy 1.14.1
setuptools 74.1.2
sympy 1.13.2
threadpoolctl 3.5.0
torch 2.4.1
typing_extensions 4.12.2
(Edit: sorry pinged the wrong Jérémie originally ...)
Thanks very much for that! Your example has helped to clarify the picture for me a lot.
torch
vendors a libomp.dylib
(without library or symbol name mangling) and always prefers that vendored copy to a system installation.
lightgbm
searches for a system installation.
As a result, if you've installed both these libraries via wheels on macOS, loading both will result in 2 copies of libomp.dylib
being loaded. This may or may not show up as runtime issues... unpredictable, because symbol resolution is lazy by default and therefore depends on the code paths used.
Even if all copies of libomp.dylib
loaded into the process are ABI-compatible with each other, there can still be runtime segfaults as a result of mixing symbols from libraries loaded at different memory addresses, I think.
Not sure why @connortann was not able to reproduce this in https://github.com/microsoft/LightGBM/issues/6595#issuecomment-2273745921. That comment shows:
Without sklearn installed; import
torch
thenlightgbm
. Tests pass
Probably because that example uses different codepaths in torch
. Many OpenMP symbols would be resolved only at the first call site (as described in this Stack Overflow answer and the macOS docs it links to), so different code paths can lead to different behavior in terms of which copies of libomp.dylib
certain symbols are found in.
I think some mix of the following would make this better for users.
torch
could more aggressively isolate its OpenMP dependencyIf torch
wants to vendor its own OpenMP in this way, it could further isolate that dependency to only torch
's own uses, by doing one of the following:
lightgbm
could vendor OpenMP like torch
is, but with that added strictness described aboveI really do not want to do this, for the reasons mentioned in in #6391 and the things linked to it.
torch
could stop vendoring OpenMP and use the same LC_RPATH search order lightgbm
doesI don't know if this would be palatable for torch
. It comes with its own challenges.
lightgbm
could add something like @loader_path/../../torch/lib
earlier in its list of RPATHSThis only works as long as torch
is vendoring a version of libomp.dylib
that lightgbm
is ABI-compatible with.
And it only helps for the narrow case of lightgbm
and torch
with no other OpenMP-using dependencies. Every other library depending on OpenMP (e.g. xgboost
, scikit-learn
) would need to do something similar for them to all reliably use that same copy of libomp.dylib
at runtime.
As described in https://pypackaging-native.github.io/key-issues/native-dependencies/blas_openmp/#potential-solutions-or-mitigations. This is the wheel-based equivalent of how conda
handles this case, as @lesteve alluded to... you download a single copy of the library into the environment, and everything else dynamically links to it.
I personally would be willing to help with this community effort, though I don't feel qualified to lead it.
Some related discussions (about shared-library-only wheels, not OpenMP) that have been happening in RAPIDS libraries:
Description
A segmentation fault occurs on MacOS when lightgbm and pytorch are both installed, depending on the order of imports.
Possibly related: #4229
Reproducible example
To reproduce the issue on GH actions:
Leads to
Fatal Python error: Segmentation fault
. Full output:Environment info
LightGBM version or commit hash:
4.5.0
Result of
pip list
:Additional Comments
We came across this issue over at the
shap
repo, trying to run tests with the latest versions of both pytorch and lightgbm. We initially raised this issue on the pytorch issue tracker: https://github.com/pytorch/pytorch/issues/121101 .However, the underlying issue doesn't seem to be specific just to pytorch or lightgbm, but rather it relates to the mutual compatibility of pytorch and lightgbm. The issue seems to relate to multiple ~OpenML~ OpenMP runtimes being loaded.
So, I thought it would be worth raising the issue here too in the hope that it helps us collectively find a fix.