pybind / pybind11

Seamless operability between C++11 and Python
https://pybind11.readthedocs.io/
Other
15.66k stars 2.1k forks source link

[BUG]: Segmentation Fault 11 w/ Conda + Pybind11 #3907

Closed coreyjadams closed 2 years ago

coreyjadams commented 2 years ago

Required prerequisites

Problem description

I have a segmentation fault on macos that only appears using the conda builds of python. I haven't been able to solve this one myself, sorry.

In short: When using the package I've built with pybind11, I can not import the libraries from python without a segfault. I've verified this with python 3.6, 3.9, 3.10, and using the latest version of pybind11. I have a stand-alone repository that reproduces this bug.

Here is the stack track when running with lldb, it appears to be related to take_gil

>>> import larcv
Process 24818 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
    frame #0: 0x00000001041afc17 libpython3.10.dylib`take_gil + 71
libpython3.10.dylib`take_gil:
->  0x1041afc17 <+71>: movq   0x10(%rax), %r13
    0x1041afc1b <+75>: leaq   0x1b0(%r13), %r12
    0x1041afc22 <+82>: movq   %r12, %rdi
    0x1041afc25 <+85>: callq  0x1042e1212               ; symbol stub for: pthread_mutex_lock
Target 0: (python) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
  * frame #0: 0x00000001041afc17 libpython3.10.dylib`take_gil + 71
    frame #1: 0x0000000104226230 libpython3.10.dylib`PyGILState_Ensure + 48
    frame #2: 0x0000000101e495df pylarcv.cpython-310-darwin.so`___lldb_unnamed_symbol1$$pylarcv.cpython-310-darwin.so + 63
    frame #3: 0x0000000101e490a6 pylarcv.cpython-310-darwin.so`PyInit_pylarcv + 118
    frame #4: 0x00000001001fd17e python`_imp_create_dynamic + 1486
    frame #5: 0x00000001000e75a5 python`cfunction_vectorcall_FASTCALL + 85
    frame #6: 0x00000001001b2b9a python`_PyEval_EvalFrameDefault + 2986
    frame #7: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #8: 0x00000001001c16ee python`call_function + 174
    frame #9: 0x00000001001b8fec python`_PyEval_EvalFrameDefault + 28668
    frame #10: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #11: 0x00000001001c16ee python`call_function + 174
    frame #12: 0x00000001001b79b2 python`_PyEval_EvalFrameDefault + 22978
    frame #13: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #14: 0x00000001001c16ee python`call_function + 174
    frame #15: 0x00000001001b6cd3 python`_PyEval_EvalFrameDefault + 19683
    frame #16: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #17: 0x00000001001c16ee python`call_function + 174
    frame #18: 0x00000001001b6cd3 python`_PyEval_EvalFrameDefault + 19683
    frame #19: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #20: 0x00000001001c16ee python`call_function + 174
    frame #21: 0x00000001001b6cd3 python`_PyEval_EvalFrameDefault + 19683
    frame #22: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #23: 0x000000010008577b python`object_vacall + 427
    frame #24: 0x0000000100085a29 python`_PyObject_CallMethodIdObjArgs + 249
    frame #25: 0x00000001001f8a64 python`PyImport_ImportModuleLevelObject + 3076
    frame #26: 0x00000001001b8410 python`_PyEval_EvalFrameDefault + 25632
    frame #27: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #28: 0x00000001001aa979 python`builtin_exec + 345
    frame #29: 0x00000001000e75a5 python`cfunction_vectorcall_FASTCALL + 85
    frame #30: 0x00000001001b2b9a python`_PyEval_EvalFrameDefault + 2986
    frame #31: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #32: 0x00000001001c16ee python`call_function + 174
    frame #33: 0x00000001001b8fec python`_PyEval_EvalFrameDefault + 28668
    frame #34: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #35: 0x00000001001c16ee python`call_function + 174
    frame #36: 0x00000001001b79b2 python`_PyEval_EvalFrameDefault + 22978
    frame #37: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #38: 0x00000001001c16ee python`call_function + 174
    frame #39: 0x00000001001b6cd3 python`_PyEval_EvalFrameDefault + 19683
    frame #40: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #41: 0x00000001001c16ee python`call_function + 174
    frame #42: 0x00000001001b6cd3 python`_PyEval_EvalFrameDefault + 19683
    frame #43: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #44: 0x000000010008577b python`object_vacall + 427
    frame #45: 0x0000000100085a29 python`_PyObject_CallMethodIdObjArgs + 249
    frame #46: 0x00000001001f8a64 python`PyImport_ImportModuleLevelObject + 3076
    frame #47: 0x00000001001b8410 python`_PyEval_EvalFrameDefault + 25632
    frame #48: 0x00000001001b0588 python`_PyEval_Vector + 376
    frame #49: 0x00000001002277a9 python`PyRun_InteractiveOneObjectEx + 1049
    frame #50: 0x000000010022640a python`_PyRun_InteractiveLoopObject + 122
    frame #51: 0x0000000100225cbf python`_PyRun_AnyFileObject + 63
    frame #52: 0x000000010022a106 python`PyRun_AnyFileExFlags + 118
    frame #53: 0x0000000100250f2f python`pymain_run_stdin + 175
    frame #54: 0x000000010025057d python`pymain_run_python + 509
    frame #55: 0x0000000100250335 python`Py_RunMain + 37
    frame #56: 0x0000000100251910 python`pymain_main + 64
    frame #57: 0x00000001000026d8 python`main + 56
    frame #58: 0x000000010049a51e dyld`start + 462

Reproducible example code

This repository can reproduce the bug.  Sorry if you wanted something smaller, this is about as small as I can make it, and it is nearly stand alone - obviously, you need conda to run it...

[git@github.com:coreyjadams/larcv3-pybind11-example.git](git@github.com:coreyjadams/larcv3-pybind11-example.git)

To replicate the bug, you need to be on Mac OS (I am on Monteray, the latest) and using miniconda.  I created an environment for each test I did:

conda create -n test-env-python-3.10 # Accept any questions, etc
conda activate test-env-python-3.10 # Activate the environment
conda install python=3.10 cmake scikit-build # The dependencies are just build systems.

Then, after cloning the repository I linked above, one can do:
```bash
git submodule update --init # pybind11 is a submodule here
python setup.py build # Trigger scikit-build to run cmake
python setup.py install

From a different directory (otherwise, it tries to import the larcv folder in the repo), do:

>>> import larcv

And it ought to reproduce the crash.

henryiii commented 2 years ago

Conda doesn't support building from python, only from Conda-build. You are likely mixing the system compilers and the conda compilers, causing the crash. Try conda install compilers - that might get it to use the conda compilers (make sure you remove any caching, like _skbuild).

wolfv commented 2 years ago

I do see this issue as well on macOS x64 -- but I am pretty sure I am using the conda compilers :)

I tried to add -undefined dynamic_lookup which helped in the past, and I tried to remove the CMAKE_STRIP step, but none of that helped so far. Will investigate further.

It's failing for us for rclpy which is a dependency of ROS, the robot operating system. Same exact error.

wolfv commented 2 years ago

Hm, I managed to replicate the issue with your example larcv code. The problem seems to boil down to not explicitly link Python in the lower level libraries (or anywhere) and to trust "-undefined dynamic_lookup".

I've added

set_target_properties(larcv3 PROPERTIES
                      LINK_FLAGS "-undefined dynamic_lookup")

and removed any instances of linking to ${Python_LIBRARIES} and things then seem to work. I think the pybind11_add_module automatically sets that linker flag already.

coreyjadams commented 2 years ago

@wolfv thanks for this tip! I will test it out tomorrow and get back to you, that'd be awesome to have this resolved.

wolfv commented 2 years ago

In my case, pybind11_add_module(blabla SHARED ...) did not work, however pybind11_add_module(blabla MODULE ...) works.