tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Segfault in multiple tests on OS X, Python 3.9 #376

Closed hyanwong closed 3 years ago

hyanwong commented 3 years ago

Tests all pass with OS X on python 3.8, but on 3.9 there are multiple failures. The first I get, from running python3 -m pytest -vs -n0 is in tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors:

python3 -m pytest -vs -n0
================================================================ test session starts =================================================================
platform darwin -- Python 3.9.0, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/local/anaconda3/envs/python39/bin/python3
cachedir: .pytest_cache
rootdir: /Users/yan/Documents/Research/BDI/tsinfer, configfile: pytest.ini, testpaths: tests
plugins: xdist-2.1.0, cov-2.10.1, forked-1.3.0
collected 598 items                                                                                                                                  

tests/test_cli.py::TestMain::test_cli_main PASSED
tests/test_cli.py::TestMain::test_main PASSED
tests/test_cli.py::TestDefaultPaths::test_get_default_path PASSED
tests/test_cli.py::TestDefaultPaths::test_get_ancestors_path PASSED
tests/test_cli.py::TestDefaultPaths::test_get_ancestors_trees_path PASSED
tests/test_cli.py::TestDefaultPaths::test_get_output_trees_path PASSED
tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors Fatal Python error: Segmentation fault

Current thread 0x0000000118bb1dc0 (most recent call first):
  File "/Users/yan/Documents/Research/BDI/tsinfer/tsinfer/inference.py", line 1095 in __complete_epoch
  File "/Users/yan/Documents/Research/BDI/tsinfer/tsinfer/inference.py", line 1122 in __match_ancestors_single_threaded
  File "/Users/yan/Documents/Research/BDI/tsinfer/tsinfer/inference.py", line 1169 in match_ancestors
  File "/Users/yan/Documents/Research/BDI/tsinfer/tsinfer/inference.py", line 370 in match_ancestors
  File "/Users/yan/Documents/Research/BDI/tsinfer/tests/test_cli.py", line 132 in setUp
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/unittest/case.py", line 547 in _callSetUp
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/unittest/case.py", line 589 in run
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/unittest/case.py", line 653 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/unittest.py", line 278 in runtest
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 163 in pytest_runtest_call
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 256 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 310 in from_call
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 255 in call_runtest_hook
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 216 in call_and_report
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 127 in runtestprotocol
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/main.py", line 338 in pytest_runtestloop
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/main.py", line 313 in _main
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/main.py", line 257 in wrap_session
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/main.py", line 306 in pytest_cmdline_main
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/config/__init__.py", line 164 in main
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/_pytest/config/__init__.py", line 187 in console_main
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/site-packages/pytest/__main__.py", line 5 in <module>
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/runpy.py", line 87 in _run_code
  File "/usr/local/anaconda3/envs/python39/lib/python3.9/runpy.py", line 197 in _run_module_as_main
zsh: segmentation fault  python3 -m pytest -vs -n0

Running in parallel identifies at least these others as problems

maximum crashed workers reached: 16

====================================================================== FAILURES ======================================================================
_________________________________________________________________ tests/test_cli.py __________________________________________________________________
[gw1] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw1' crashed while running 'tests/test_cli.py::TestCommandsDefaults::test_infer_from_ts'
_________________________________________________________________ tests/test_cli.py __________________________________________________________________
[gw2] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw2' crashed while running 'tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors'
_________________________________________________________________ tests/test_cli.py __________________________________________________________________
[gw3] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw3' crashed while running 'tests/test_cli.py::TestCommandsDefaults::test_infer'
_________________________________________________________________ tests/test_cli.py __________________________________________________________________
[gw0] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw0' crashed while running 'tests/test_cli.py::TestCommandsDefaults::test_infer_bad_file'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw6] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw6' crashed while running 'tests/test_inference.py::TestAugmentedAncestorsRoundTrip::test_simple_example'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw7] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw7' crashed while running 'tests/test_inference.py::TestMissingDataRoundTrip::test_random_data_invariant_sites_ancestral_state'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw5] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw5' crashed while running 'tests/test_inference.py::TestUnfinalisedErrors::test_match_samples_unfinalised'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw8] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw8' crashed while running 'tests/test_inference.py::TestAlgorithmsExactlyEqualPrecision24::test_four_samples'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw9] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw9' crashed while running 'tests/test_inference.py::TestExtractAncestors::test_simple_simulation'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw10] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw10' crashed while running 'tests/test_evaluation.py::TestPerfectInference::test_single_tree_defaults'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw4] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw4' crashed while running 'tests/test_inference.py::TestAugmentedAncestorsRoundTrip::test_random_data_large_example'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw11] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw11' crashed while running 'tests/test_evaluation.py::TestPerfectInference::test_small_smc_path_compression'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw13] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw13' crashed while running 'tests/test_inference.py::TestSparseAncestorsRoundTrip::test_single_locus_two_samples'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw14] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw14' crashed while running 'tests/test_inference.py::TestGeneratedAncestors::test_no_recombination'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw12] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw12' crashed while running 'tests/test_inference.py::TestAugmentedAncestorsRoundTrip::test_single_locus'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw15] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw15' crashed while running 'tests/test_inference.py::TestMatchSamples::test_partial_bad_indexes'
______________________________________________________________ tests/test_inference.py _______________________________________________________________
[gw16] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw16' crashed while running 'tests/test_inference.py::TestPathCompressionSamplesCEngine::test_simulation_with_error'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw18] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw18' crashed while running 'tests/test_evaluation.py::TestMeanSampleAncestry::test_random_data_inferred_no_simplify'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw17] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw17' crashed while running 'tests/test_evaluation.py::TestCountSampleChildEdges::test_inferred_simplify'
______________________________________________________________ tests/test_evaluation.py ______________________________________________________________
[gw19] darwin -- Python 3.9.0 /usr/local/anaconda3/envs/python39/bin/python3
worker 'gw19' crashed while running 'tests/test_evaluation.py::TestNodeSpan::test_inferred'
jeromekelleher commented 3 years ago

Great, thanks @hyanwong. Can you run this through gdb please?

gdb python
run -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
bt

This should run until it segfaults, and then we want to print out the backtrace. (bt)

Or whatever equivalent debugger is present in the clang toolchain.

hyanwong commented 3 years ago

OS X needs to codesign gdb, or run under sudo. Once I do that, it's not super helpful with the default install of Python 3.9 (below). I guess I need to recompile Python 3.9 for this to work?

Reading symbols from /usr/local/bin/python3...
(No debugging symbols found in /usr/local/bin/python3)
(gdb) run -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
Starting program: /usr/local/bin/python3 -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
[New Thread 0x2603 of process 34225]
[New Thread 0x1903 of process 34225]
warning: unhandled dyld version (16)
[New Thread 0x2303 of process 34225]

Thread 3 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread 0x2303 of process 34225]
0x000000010000e000 in ?? ()
(gdb) bt
#0  0x000000010000e000 in ?? ()
#1  0x0000000100000000 in ?? ()
#2  0x0000000000000006 in ?? ()
#3  0x00007ffeefbffbc0 in ?? ()
#4  0x00007ffeefbffc39 in ?? ()
#5  0x00007ffeefbffc3c in ?? ()
#6  0x00007ffeefbffc43 in ?? ()
#7  0x00007ffeefbffc47 in ?? ()
#8  0x00007ffeefbffc4b in ?? ()
#9  0x0000000000000000 in ?? ()
benjeffery commented 3 years ago

Can you run

make clean
CFLAGS="-ggdb -O0" python3 setup.py build_ext --inplace

before the pytest?

Although if your stack is only 9 deep then we are probably still in Python-land, which is weird.

hyanwong commented 3 years ago

Oddly, that hangs when running.

I'll try to recompile Python 3.9 with the debug flags.

hyanwong commented 3 years ago

Hmm https://unconj.ca/blog/setting-up-gdb-for-debugging-python-on-os-x.html:

"Note that for some reason, Python always sends a SIGTRAP before it begins executing the script. Don’t worry, just continue. If you want to find out why your program segfaults, you can run bt:"

benjeffery commented 3 years ago

Ah didn't know that! Just do continue then!

hyanwong commented 3 years ago

Can't get the damn thing to do anything other than hang now. Argh.

benjeffery commented 3 years ago
make clean
make

?

hyanwong commented 3 years ago

Yep. Tried all that

jeromekelleher commented 3 years ago

Where did you get your Python from @hyanwong? Is there a debug version? We'll need the debug symbols everywhere to get to the bottom of this.

hyanwong commented 3 years ago

Slightly weirdly, after make; make clean, I can't now run the tests normally:

/usr/local/bin/python3.9 -m pytest tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
================================================================ test session starts =================================================================
platform darwin -- Python 3.9.0, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /Users/yan/Documents/Research/BDI/tsinfer, configfile: pytest.ini
plugins: xdist-2.1.0, cov-2.10.1, forked-1.3.0
gw0 [0] / gw1 [0] / gw2 [0] / gw3 [0]                                                                                                                

======================================================================= ERRORS =======================================================================
_________________________________________________________ ERROR collecting tests/test_cli.py _________________________________________________________
ImportError while importing test module '/Users/yan/Documents/Research/BDI/tsinfer/tests/test_cli.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/local/Cellar/python@3.9/3.9.0_2/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_cli.py:35: in <module>
    import tsinfer
tsinfer/__init__.py:38: in <module>
    from .inference import *  # NOQA
tsinfer/inference.py:36: in <module>
    import _tsinfer
E   ModuleNotFoundError: No module named '_tsinfer'
============================================================== short test summary info ===============================================================
ERROR tests/test_cli.py
hyanwong commented 3 years ago

Where did you get your Python from @hyanwong? Is there a debug version? We'll need the debug symbols everywhere to get to the bottom of this.

I'm trying it on 2 different python 3.9 versions, one from conda, another from the OS X installation repo "Homebrew". Neither seem to have debug symbols compiled in, so I guess I'll need to compile my own.

jeromekelleher commented 3 years ago

Yes, looks like compiling your own might be the simplest for getting debug symbols, as there's mac-specific technical issues with the conda distributed binaries.

hyanwong commented 3 years ago

I guess this is CPython, right?

benjeffery commented 3 years ago

From reading your link https://unconj.ca/blog/setting-up-gdb-for-debugging-python-on-os-x.html it seems they have symbols without recompiling python.

hyanwong commented 3 years ago

Well now. If I compile my own version of python3.9.1rc1, the tests all now pass.

jeromekelleher commented 3 years ago

Interesting/annoying. What version of Python and where does it come from are we using in CI @benjeffery? Is this some issue with Apple breaking things again?

hyanwong commented 3 years ago

I don't think this is Apple, as they don't provide Python 3.9, as far as I know. I suspect it's conda? Or maybe there's been a bug fix in Python 3.9.1?

jeromekelleher commented 3 years ago

I suspect it's conda?

You hit the problem via Homebrew as well though, right?

hyanwong commented 3 years ago

Both on 3.9.0.

I've just tried 3.9.1 compiled via the pyenv package (https://github.com/pyenv/pyenv)

pyenv install -v 3.9-dev
# compiles python3.9d
python3.9d -m pytest                      
================================================================ test session starts =================================================================
platform darwin -- Python 3.9.1rc1+, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
rootdir: /Users/yan/Documents/Research/BDI/tsinfer, configfile: pytest.ini, testpaths: tests
plugins: xdist-2.1.0, cov-2.10.1, forked-1.3.0
gw0 ok / gw1 ok / gw2 ok / gw3 ok
gw0 [598] / gw1 [598] / gw2 [598] / gw3 [598]

Works.

So perhaps we should simply not support 3.9, but only 3.9.1 on OS X? Apparently 3.9.1 will be released in December.

jeromekelleher commented 3 years ago

I'd very much like to know what happened here - segfaults shouldn't happen, ever.

hyanwong commented 3 years ago

I can recompile cpython at 3.9.0 and see if it segfaults.

jeromekelleher commented 3 years ago

Yes please - including debug symbols if you can.

benjeffery commented 3 years ago

The CI python is from miniconda: /usr/local/miniconda/envs/anaconda-client-env/bin/python Python 3.9.0

Here are the flags it was built with:

CC = "clang"
CFLAGS = "-Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -fwrapv -O2 -Wall -fPIC  -O2  -isystem /usr/local/miniconda/envs/anaconda-client-env/include -fPIC  -O2  -isystem /usr/local/miniconda/envs/anaconda-client-env/include "

Obtained with the awesome python -m sysinfo

benjeffery commented 3 years ago

I'm running gdb on CI now.

benjeffery commented 3 years ago

Proving tricky!

Reading symbols from python...
(No debugging symbols found in python)
warning: `/tmp/lto.o': can't open to read symbols: No such file or directory.
Starting program: /usr/local/miniconda/envs/anaconda-client-env/bin/python -m pytest -n0 -vs tests/test_cli.py::TestCommandsDefaults::test_augment_ancestors
Note: this version of macOS has System Integrity Protection.
Because `startup-with-shell' is enabled, gdb has worked around this by
caching a copy of your shell.  The shell used by "run" is now:
    /Users/runner/Library/Caches/gdb/bin/bash
Unable to find Mach task port for process-id 2464: (os/kern) failure (0x5).
 (please check gdb is codesigned - see taskgated(8))
The program is not being run.
No stack.
hyanwong commented 3 years ago

Yes, that's why I was having problems!

hyanwong commented 3 years ago

You can run as sudo. Or code-sign the gdb binary.

jeromekelleher commented 3 years ago

Maybe the CLI is a red herring here, and it just happens to be the first test that provokes the segfault. Can we find a test that doesn't involve the CLI that segfaults?

If not, then maybe it's less worth worrying about and it might just be some bug in Python around subshells or something.

hyanwong commented 3 years ago

Maybe the CLI is a red herring here, and it just happens to be the first test that provokes the segfault. Can we find a test that doesn't involve the CLI that segfaults?

I gave a list of segfaulting tests at the top. Lots are not command-line, e.g. tests/test_inference.py::TestGeneratedAncestors::test_no_recombination

hyanwong commented 3 years ago

More oddities. This version of python, compiled from source by Pyenv, crashes:

(base) yan@Yans-New-Air tsinfer % /Users/yan/.pyenv/versions/3.9.0/bin/python3.9          
Python 3.9.0 (default, Nov 30 2020, 17:02:19) 
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
>>> 

Whereas this one, compiled by me from the GitHub repo, passes:

(base) yan@Yans-New-Air tsinfer % /usr/local/bin/python3.9                      
Python 3.9.0 (tags/v3.9.0:9cf6752276, Nov 30 2020, 20:02:19) 
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin

loading with python3.9 -v reveals that the only different library loaded is encodings.ascii which isn't loaded in the version that crashes. But that can't be anything to do with it, I would have thought.

hyanwong commented 3 years ago

Could it be the same issue as pybind11 had, here?

jeromekelleher commented 3 years ago

All very strange.

jeromekelleher commented 3 years ago

I'd be happy this isn't our fault if I could get a backtrace from gbd - did you have any success building these with debug symbols @hyanwong? Avoiding the CLI tests will get rid of a level of indirection, hopefully make things more straightforward.

benjeffery commented 3 years ago

Maybe the CLI is a red herring here, and it just happens to be the first test that provokes the segfault. Can we find a test that doesn't involve the CLI that segfaults?

tests/test_inference.py::TestUnfinalisedErrors::test_match_samples_unfinalised

I'm still digging. Can't get gdb working on CI as can't sign the binary so bisecting to find the cause.

hyanwong commented 3 years ago

Unfortunately the version with gdb is the one that passed. I'm getting intermittent segfault failures using 3.9.0rc1, all when calling evaluation.py (see below), but not in a repeatable manner - seems like it could be a race condition. I don't (seem to) get them when calling evaluation.py directly though - I might just need to keep on rerunning. But I'm not sure whether to dig deeper as this might simply be an RC1 thing.

E               subprocess.CalledProcessError: Command '['/usr/local/bin/python3.9', 'evaluation.py', 'ancestor-properties', '-n', '5', '-l', '0.5', '-d', '/var/folders/00/l36df9q14hv13hhw8wglncch0000gr/T/tsi_eval53fi_31d']' died with <Signals.SIGSEGV: 11>.
hyanwong commented 3 years ago

Yay! Got one (although this is using 3.9.0rc1, not the actual release:

(gdb) run  evaluation.py ancestor-properties -n 5 -l 0.5
Starting program: /usr/local/bin/python3.9 evaluation.py ancestor-properties -n 5 -l 0.5
[New Thread 0x2607 of process 24849]
[New Thread 0x1b2f of process 24849]
warning: unhandled dyld version (16)
[New Thread 0x1e0b of process 24849]
[New Thread 0x1f03 of process 24849]
          num_sites  num_trees  estimated_anc_num  estimated_anc_mean_len  ...  recombination_rate  mutation_rate       Ne  random_seed
length                                                                     ...                                                         
50000.0        40.1       32.5               10.8                0.887581  ...        1.000000e-08   1.000000e-08  10000.0  617528857.4
100000.0       80.5       60.7               18.6                0.806291  ...        1.000000e-08   1.000000e-08  10000.0  680026157.5
150000.0      125.2       97.1               26.4                0.721979  ...        1.000000e-08   1.000000e-08  10000.0  618470870.1
200000.0      155.7      118.7               32.2                0.647576  ...        1.000000e-08   1.000000e-08  10000.0  441609977.5
250000.0      212.9      153.8               39.3                0.659095  ...        1.000000e-08   1.000000e-08  10000.0  514572925.7
300000.0      234.6      184.9               45.3                0.522309  ...        1.000000e-08   1.000000e-08  10000.0  557663632.2
350000.0      283.6      217.7               57.7                0.473193  ...        1.000000e-08   1.000000e-08  10000.0  581671619.9
400000.0      341.0      262.1               67.9                0.454860  ...        1.000000e-08   1.000000e-08  10000.0  442279087.5
450000.0      402.9      298.5               80.9                0.379989  ...        1.000000e-08   1.000000e-08  10000.0  446110263.2
500000.0      425.8      321.3               81.2                0.419182  ...        1.000000e-08   1.000000e-08  10000.0  621938852.6

[10 rows x 13 columns]

Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001310bb in _Py_DECREF (filename=0x100400640 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422     if (--op->ob_refcnt != 0) {
hyanwong commented 3 years ago

Ah, no, I get the same error with python 3.9.0 (the actual release, not the rc1 version), but it's not 100% repeatable. I'll see if it goes away with 3.9.1rc1

Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422     if (--op->ob_refcnt != 0) {
benjeffery commented 3 years ago

Can you get the full bt? I've added some logging and now the bug is gone.... ~looking like a race condition in TreeSequenceBuilder_add_mutations~

hyanwong commented 3 years ago

It seems to disappear with python 3.9.1rc1. Stupidly I forgot to backtrace. Doh. Need to recompile 3.9.0 and do it again. Damn!

hyanwong commented 3 years ago

Here's the full bt @benjeffery. Anecdotally it seems to be triggered only when I'm running a lot of threads at a time. I've left the GDB session running in case you need any more detail. Off to bed now!

Thread 2 received signal SIGSEGV, Segmentation fault.
0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
422     if (--op->ob_refcnt != 0) {
(gdb) bt
#0  0x00000001001316eb in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:422
#1  0x00000001001316ac in _Py_XDECREF (op=0xfdfdfdfdfdfdfdfd) at ./Include/object.h:497
#2  0x0000000100130ebd in meth_dealloc (m=0x117ea29b0) at Objects/methodobject.c:169
#3  0x000000010013c5e6 in _Py_Dealloc (op=0x117ea29b0) at Objects/object.c:2209
#4  0x000000010011449e in _Py_DECREF (filename=0x1003fc650 "./Include/object.h", lineno=497, op=0x117ea29b0) at ./Include/object.h:430
#5  0x000000010011827c in _Py_XDECREF (op=0x117ea29b0) at ./Include/object.h:497
#6  0x000000010011ed49 in free_keys_object (keys=0x117525310) at Objects/dictobject.c:598
#7  0x00000001001164b9 in dictkeys_decref (dk=0x117525310) at Objects/dictobject.c:333
#8  0x000000010011ab06 in dict_dealloc (mp=0x117ea2f50) at Objects/dictobject.c:2026
#9  0x000000010013c5e6 in _Py_Dealloc (op=0x117ea2f50) at Objects/object.c:2209
#10 0x00000001002defbe in _Py_DECREF (filename=0x1003f67be "Python/pystate.c", lineno=768, op=0x117ea2f50) at ./Include/object.h:430
#11 0x00000001002e0624 in _PyInterpreterState_ClearModules (interp=0x102808210) at Python/pystate.c:768
#12 0x00000001002b6398 in _PyImport_Cleanup (tstate=0x100c05790) at Python/import.c:623
#13 0x00000001002d98bd in Py_FinalizeEx () at Python/pylifecycle.c:1431
#14 0x0000000100316d0d in Py_RunMain () at Modules/main.c:679
#15 0x00000001003170b9 in pymain_main (args=0x7ffeefbffab0) at Modules/main.c:707
#16 0x0000000100317107 in Py_BytesMain (argc=7, argv=0x7ffeefbffb20) at Modules/main.c:731
#17 0x00000001000013b2 in main ()
benjeffery commented 3 years ago

I think that confirms this is the same issue kind of issue as the pybind one you linked above. Will look into the reasoning there and how they fixed.

jeromekelleher commented 3 years ago

I'd be surprised if this turned out to be a problem with our code now - seems like a bug in Python like the pybind one? I guess we just disable the tests for now until 3.9.1 is released?

benjeffery commented 3 years ago

I don't think this is worth much more digging for now, so I agree we wait for 3.9.1.

jeromekelleher commented 3 years ago

I think we're satisfied that this isn't our bug, so I'm going to take it out of the next release and remove the "bug" label. We can keep it open until we've got CI running on 3.9.1, though.

hyanwong commented 3 years ago

The semi-standard OS X package manager "homebrew" has released python 3.9.1 and all the tests pass on my laptop with that installation. So hopefully when Github Actions & Conda have a 3.9.1 release this will all start working again. I'll run with 3.9 from now on, so I should spot any problems I hope.

jeromekelleher commented 3 years ago

I had a go at reproducing this using fastmac. I wasn't able to get a working environment in about an hour - I tried with homebrew/pip initially, and this failed because numcodecs/blosc seems to have some issue on recent llvm compilers. I tried with conda, but borked up the whole thing and had to give up. I'll have another go at some point when I get a chance.

Macs are such a horrid unix environment...

benjeffery commented 3 years ago

I doubt this will be the last time so certainly worth keeping what ever incantations you use to get the debugger working.

jeromekelleher commented 3 years ago

I've gotten a working environment using fastmac. Using python 3.9 from conda I can reproduce the segfault if I use the built-in clang 12 compiler. However, if I install clang11 from conda-forge (conda install -c conda-forge c-compiler) and compile use this, there's no issue.

So, it does seem to be some connection with clang12, and maybe something to do with using different compilers for building Python and the extension module (which shouldn't matter, but hey). @hyanwong, did you get a version of this working where both Python and the extension module were build with clang12? I failed at this because of problems with numcodecs under clang12.

jeromekelleher commented 3 years ago

Huh, and now I can't reproduce even if I revert to using clang12 as the compiler. There must be something else in the toolchain that's causing the problem, which conda-forge c-compiler solves.

So, I'm inclined to say this is a weird mac problem caused by it being a fragmented crappily executed unix development environment. I think this is a wider problem (see the nodes here e.g.), so there's no much we can do. If people hit the problem we can advise them to conda install -c conda-forge c-compiler, but we'll be shipping conda packages for tsinfer 0.2 soon, so that might no be necessary either. (Good motivation for shipping conda packages, too!)

I'll see if I can get CI working by changing the compilers.