Closed hyanwong closed 3 years ago
Great, thanks @hyanwong. Can you run this through gdb please?
gdb python
run -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
bt
This should run until it segfaults, and then we want to print out the backtrace. (bt)
Or whatever equivalent debugger is present in the clang toolchain.
OS X needs to codesign gdb, or run under sudo. Once I do that, it's not super helpful with the default install of Python 3.9 (below). I guess I need to recompile Python 3.9 for this to work?
Reading symbols from /usr/local/bin/python3...
(No debugging symbols found in /usr/local/bin/python3)
(gdb) run -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
Starting program: /usr/local/bin/python3 -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
[New Thread 0x2603 of process 34225]
[New Thread 0x1903 of process 34225]
warning: unhandled dyld version (16)
[New Thread 0x2303 of process 34225]
Thread 3 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x2303 of process 34225]
0x000000010000e000 in ?? ()
(gdb) bt
#0 0x000000010000e000 in ?? ()
#1 0x0000000100000000 in ?? ()
#2 0x0000000000000006 in ?? ()
#3 0x00007ffeefbffbc0 in ?? ()
#4 0x00007ffeefbffc39 in ?? ()
#5 0x00007ffeefbffc3c in ?? ()
#6 0x00007ffeefbffc43 in ?? ()
#7 0x00007ffeefbffc47 in ?? ()
#8 0x00007ffeefbffc4b in ?? ()
#9 0x0000000000000000 in ?? ()
Can you run
make clean
CFLAGS="-ggdb -O0" python3 setup.py build_ext --inplace
before the pytest?
Although if your stack is only 9 deep then we are probably still in Python-land, which is weird.
Oddly, that hangs when running.
I'll try to recompile Python 3.9 with the debug flags.
Hmm https://unconj.ca/blog/setting-up-gdb-for-debugging-python-on-os-x.html:
"Note that for some reason, Python always sends a SIGTRAP before it begins executing the script. Don’t worry, just continue. If you want to find out why your program segfaults, you can run bt:"
Ah didn't know that! Just do continue
then!
Can't get the damn thing to do anything other than hang now. Argh.
make clean
make
?
Yep. Tried all that
Where did you get your Python from @hyanwong? Is there a debug version? We'll need the debug symbols everywhere to get to the bottom of this.
Slightly weirdly, after make; make clean, I can't now run the tests normally:
/usr/local/bin/python3.9 -m pytest tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
================================================================ test session starts =================================================================
platform darwin -- Python 3.9.0, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /Users/yan/Documents/Research/BDI/tsinfer, configfile: pytest.ini
plugins: xdist-2.1.0, cov-2.10.1, forked-1.3.0
gw0 [0] / gw1 [0] / gw2 [0] / gw3 [0]
======================================================================= ERRORS =======================================================================
_________________________________________________________ ERROR collecting tests/test_cli.py _________________________________________________________
ImportError while importing test module '/Users/yan/Documents/Research/BDI/tsinfer/tests/test_cli.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/local/Cellar/python@3.9/3.9.0_2/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests/test_cli.py:35: in <module>
import tsinfer
tsinfer/__init__.py:38: in <module>
from .inference import * # NOQA
tsinfer/inference.py:36: in <module>
import _tsinfer
E ModuleNotFoundError: No module named '_tsinfer'
============================================================== short test summary info ===============================================================
ERROR tests/test_cli.py
Where did you get your Python from @hyanwong? Is there a debug version? We'll need the debug symbols everywhere to get to the bottom of this.
I'm trying it on 2 different python 3.9 versions, one from conda, another from the OS X installation repo "Homebrew". Neither seem to have debug symbols compiled in, so I guess I'll need to compile my own.
Yes, looks like compiling your own might be the simplest for getting debug symbols, as there's mac-specific technical issues with the conda distributed binaries.
I guess this is CPython, right?
From reading your link https://unconj.ca/blog/setting-up-gdb-for-debugging-python-on-os-x.html it seems they have symbols without recompiling python.
Well now. If I compile my own version of python3.9.1rc1, the tests all now pass.
Interesting/annoying. What version of Python and where does it come from are we using in CI @benjeffery? Is this some issue with Apple breaking things again?
I don't think this is Apple, as they don't provide Python 3.9, as far as I know. I suspect it's conda? Or maybe there's been a bug fix in Python 3.9.1?
I suspect it's conda?
You hit the problem via Homebrew as well though, right?
Both on 3.9.0.
I've just tried 3.9.1 compiled via the pyenv
package (https://github.com/pyenv/pyenv)
pyenv install -v 3.9-dev
# compiles python3.9d
python3.9d -m pytest
================================================================ test session starts =================================================================
platform darwin -- Python 3.9.1rc1+, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /Users/yan/Documents/Research/BDI/tsinfer, configfile: pytest.ini, testpaths: tests
plugins: xdist-2.1.0, cov-2.10.1, forked-1.3.0
gw0 ok / gw1 ok / gw2 ok / gw3 ok
gw0 [598] / gw1 [598] / gw2 [598] / gw3 [598]
Works.
So perhaps we should simply not support 3.9, but only 3.9.1 on OS X? Apparently 3.9.1 will be released in December.
I'd very much like to know what happened here - segfaults shouldn't happen, ever.
I can recompile cpython at 3.9.0 and see if it segfaults.
Yes please - including debug symbols if you can.
The CI python is from miniconda:
/usr/local/miniconda/envs/anaconda-client-env/bin/python
Python 3.9.0
Here are the flags it was built with:
CC = "clang"
CFLAGS = "-Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /usr/local/miniconda/envs/anaconda-client-env/include -fPIC -O2 -isystem /usr/local/miniconda/envs/anaconda-client-env/include "
Obtained with the awesome
python -m sysinfo
I'm running gdb
on CI now.
Proving tricky!
Reading symbols from python...
(No debugging symbols found in python)
warning: `/tmp/lto.o': can't open to read symbols: No such file or directory.
Starting program: /usr/local/miniconda/envs/anaconda-client-env/bin/python -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
Note: this version of macOS has System Integrity Protection.
Because `startup-with-shell' is enabled, gdb has worked around this by
caching a copy of your shell. The shell used by "run" is now:
/Users/runner/Library/Caches/gdb/bin/bash
Unable to find Mach task port for process-id 2464: (os/kern) failure (0x5).
(please check gdb is codesigned - see taskgated(8))
The program is not being run.
No stack.
Yes, that's why I was having problems!
You can run as sudo. Or code-sign the gdb binary.
Maybe the CLI is a red herring here, and it just happens to be the first test that provokes the segfault. Can we find a test that doesn't involve the CLI that segfaults?
If not, then maybe it's less worth worrying about and it might just be some bug in Python around subshells or something.
Maybe the CLI is a red herring here, and it just happens to be the first test that provokes the segfault. Can we find a test that doesn't involve the CLI that segfaults?
I gave a list of segfaulting tests at the top. Lots are not command-line, e.g. tests/test_inference.py::TestGeneratedAncestors::test_no_recombination
More oddities. This version of python, compiled from source by Pyenv, crashes:
(base) yan@Yans-New-Air tsinfer % /Users/yan/.pyenv/versions/3.9.0/bin/python3.9
Python 3.9.0 (default, Nov 30 2020, 17:02:19)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
>>>
Whereas this one, compiled by me from the GitHub repo, passes:
(base) yan@Yans-New-Air tsinfer % /usr/local/bin/python3.9
Python 3.9.0 (tags/v3.9.0:9cf6752276, Nov 30 2020, 20:02:19)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
loading with python3.9 -v
reveals that the only different library loaded is encodings.ascii
which isn't loaded in the version that crashes. But that can't be anything to do with it, I would have thought.
All very strange.
I'd be happy this isn't our fault if I could get a backtrace from gbd - did you have any success building these with debug symbols @hyanwong? Avoiding the CLI tests will get rid of a level of indirection, hopefully make things more straightforward.
Maybe the CLI is a red herring here, and it just happens to be the first test that provokes the segfault. Can we find a test that doesn't involve the CLI that segfaults?
tests/test_inference.py::TestUnfinalisedErrors::test_match_samples_unfinalised
I'm still digging. Can't get gdb working on CI as can't sign the binary so bisecting to find the cause.
Unfortunately the version with gdb is the one that passed. I'm getting intermittent segfault failures using 3.9.0rc1, all when calling evaluation.py
(see below), but not in a repeatable manner - seems like it could be a race condition. I don't (seem to) get them when calling evaluation.py directly though - I might just need to keep on rerunning. But I'm not sure whether to dig deeper as this might simply be an RC1 thing.
E subprocess.CalledProcessError: Command '['/usr/local/bin/python3.9', 'evaluation.py', 'ancestor-properties', '-n', '5', '-l', '0.5', '-d', '/var/folders/00/l36df9q14hv13hhw8wglncch0000gr/T/tsi_eval53fi_31d']' died with <Signals.SIGSEGV: 11>.
Yay! Got one (although this is using 3.9.0rc1, not the actual release:
(gdb) run evaluation.py ancestor-properties -n 5 -l 0.5
Starting program: /usr/local/bin/python3.9 evaluation.py ancestor-properties -n 5 -l 0.5
[New Thread 0x2607 of process 24849]
[New Thread 0x1b2f of process 24849]
warning: unhandled dyld version (16)
[New Thread 0x1e0b of process 24849]
[New Thread 0x1f03 of process 24849]
num_sites num_trees estimated_anc_num estimated_anc_mean_len ... recombination_rate mutation_rate Ne random_seed
length ...
50000.0 40.1 32.5 10.8 0.887581 ... 1.000000e-08 1.000000e-08 10000.0 617528857.4
100000.0 80.5 60.7 18.6 0.806291 ... 1.000000e-08 1.000000e-08 10000.0 680026157.5
150000.0 125.2 97.1 26.4 0.721979 ... 1.000000e-08 1.000000e-08 10000.0 618470870.1
200000.0 155.7 118.7 32.2 0.647576 ... 1.000000e-08 1.000000e-08 10000.0 441609977.5
250000.0 212.9 153.8 39.3 0.659095 ... 1.000000e-08 1.000000e-08 10000.0 514572925.7
300000.0 234.6 184.9 45.3 0.522309 ... 1.000000e-08 1.000000e-08 10000.0 557663632.2
350000.0 283.6 217.7 57.7 0.473193 ... 1.000000e-08 1.000000e-08 10000.0 581671619.9
400000.0 341.0 262.1 67.9 0.454860 ... 1.000000e-08 1.000000e-08 10000.0 442279087.5
450000.0 402.9 298.5 80.9 0.379989 ... 1.000000e-08 1.000000e-08 10000.0 446110263.2
500000.0 425.8 321.3 81.2 0.419182 ... 1.000000e-08 1.000000e-08 10000.0 621938852.6
[10 rows x 13 columns]
Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001310bb in _Py_DECREF (filename=0x100400640 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422 if (--op->ob_refcnt != 0) {
Ah, no, I get the same error with python 3.9.0 (the actual release, not the rc1 version), but it's not 100% repeatable. I'll see if it goes away with 3.9.1rc1
Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422 if (--op->ob_refcnt != 0) {
Can you get the full bt
?
I've added some logging and now the bug is gone.... ~looking like a race condition in TreeSequenceBuilder_add_mutations
~
It seems to disappear with python 3.9.1rc1. Stupidly I forgot to backtrace. Doh. Need to recompile 3.9.0 and do it again. Damn!
Here's the full bt
@benjeffery. Anecdotally it seems to be triggered only when I'm running a lot of threads at a time. I've left the GDB session running in case you need any more detail. Off to bed now!
Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422 if (--op->ob_refcnt != 0) {
(gdb) bt
#0 0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
#1 0x00000001001316ac in _Py_XDECREF (op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:497
#2 0x0000000100130ebd in meth_dealloc (m=0x117ea29b0) at Objects/methodobject.c:169
#3 0x000000010013c5e6 in _Py_Dealloc (op=0x117ea29b0) at Objects/object.c:2209
#4 0x000000010011449e in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0x117ea29b0) at ./Include/object.h:430
#5 0x000000010011827c in _Py_XDECREF (op=0x117ea29b0) at ./Include/object.h:497
#6 0x000000010011ed49 in free_keys_object (keys=0x117525310) at Objects/dictobject.c:598
#7 0x00000001001164b9 in dictkeys_decref (dk=0x117525310) at Objects/dictobject.c:333
#8 0x000000010011ab06 in dict_dealloc (mp=0x117ea2f50) at Objects/dictobject.c:2026
#9 0x000000010013c5e6 in _Py_Dealloc (op=0x117ea2f50) at Objects/object.c:2209
#10 0x00000001002defbe in _Py_DECREF (filename=0x1003f67be "Python/pystate.c", lineno=768, op=0x117ea2f50) at ./Include/object.h:430
#11 0x00000001002e0624 in _PyInterpreterState_ClearModules (interp=0x102808210) at Python/pystate.c:768
#12 0x00000001002b6398 in _PyImport_Cleanup (tstate=0x100c05790) at Python/import.c:623
#13 0x00000001002d98bd in Py_FinalizeEx () at Python/pylifecycle.c:1431
#14 0x0000000100316d0d in Py_RunMain () at Modules/main.c:679
#15 0x00000001003170b9 in pymain_main (args=0x7ffeefbffab0) at Modules/main.c:707
#16 0x0000000100317107 in Py_BytesMain (argc=7, argv=0x7ffeefbffb20) at Modules/main.c:731
#17 0x00000001000013b2 in main ()
I think that confirms this is the same issue kind of issue as the pybind one you linked above. Will look into the reasoning there and how they fixed.
I'd be surprised if this turned out to be a problem with our code now - seems like a bug in Python like the pybind one? I guess we just disable the tests for now until 3.9.1 is released?
I don't think this is worth much more digging for now, so I agree we wait for 3.9.1.
I think we're satisfied that this isn't our bug, so I'm going to take it out of the next release and remove the "bug" label. We can keep it open until we've got CI running on 3.9.1, though.
The semi-standard OS X package manager "homebrew" has released python 3.9.1 and all the tests pass on my laptop with that installation. So hopefully when Github Actions & Conda have a 3.9.1 release this will all start working again. I'll run with 3.9 from now on, so I should spot any problems I hope.
I had a go at reproducing this using fastmac. I wasn't able to get a working environment in about an hour - I tried with homebrew/pip initially, and this failed because numcodecs/blosc seems to have some issue on recent llvm compilers. I tried with conda, but borked up the whole thing and had to give up. I'll have another go at some point when I get a chance.
Macs are such a horrid unix environment...
I doubt this will be the last time so certainly worth keeping what ever incantations you use to get the debugger working.
I've gotten a working environment using fastmac. Using python 3.9 from conda I can reproduce the segfault if I use the built-in clang 12 compiler. However, if I install clang11 from conda-forge (conda install -c conda-forge c-compiler
) and compile use this, there's no issue.
So, it does seem to be some connection with clang12, and maybe something to do with using different compilers for building Python and the extension module (which shouldn't matter, but hey). @hyanwong, did you get a version of this working where both Python and the extension module were build with clang12? I failed at this because of problems with numcodecs under clang12.
Huh, and now I can't reproduce even if I revert to using clang12 as the compiler. There must be something else in the toolchain that's causing the problem, which conda-forge c-compiler solves.
So, I'm inclined to say this is a weird mac problem caused by it being a fragmented crappily executed unix development environment. I think this is a wider problem (see the nodes here e.g.), so there's no much we can do. If people hit the problem we can advise them to conda install -c conda-forge c-compiler
, but we'll be shipping conda packages for tsinfer 0.2 soon, so that might no be necessary either. (Good motivation for shipping conda packages, too!)
I'll see if I can get CI working by changing the compilers.
Tests all pass with OS X on python 3.8, but on 3.9 there are multiple failures. The first I get, from running
python3 -m pytest -vs -n0
is intests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
:Running in parallel identifies at least these others as problems