Closed matthiaskoenig closed 5 months ago
Trying to figure this out and will provide updates along the way. Will try to provide a minimal example.
edit With the old commit from december I get the following segmentation faults
*** SIGSEGV received at time=1644922989 on cpu 11 ***
PC: @ 0x7f220d45948b (unknown) rr::doublematrix_to_py()
@ 0x7f2262f47210 544876752 (unknown)
@ 0x7f220d3c838c (unknown) _wrap_RoadRunner__simulate
@ 0x5cdb220 149681456 (unknown)
@ 0x7f220d46ad30 (unknown) (unknown)
@ 0x10c08348086f8b48 (unknown) (unknown)
[2022-02-15 12:03:09,551 E 2086665 2086665] logging.cc:317: *** SIGSEGV received at time=1644922989 on cpu 11 ***
[2022-02-15 12:03:09,551 E 2086665 2086665] logging.cc:317: PC: @ 0x7f220d45948b (unknown) rr::doublematrix_to_py()
[2022-02-15 12:03:09,552 E 2086665 2086665] logging.cc:317: @ 0x7f2262f47210 544876752 (unknown)
[2022-02-15 12:03:09,552 E 2086665 2086665] logging.cc:317: @ 0x7f220d3c838c (unknown) _wrap_RoadRunner__simulate
[2022-02-15 12:03:09,553 E 2086665 2086665] logging.cc:317: @ 0x5cdb220 149681456 (unknown)
[2022-02-15 12:03:09,557 E 2086665 2086665] logging.cc:317: @ 0x7f220d46ad30 (unknown) (unknown)
[2022-02-15 12:03:09,558 E 2086665 2086665] logging.cc:317: @ 0x10c08348086f8b48 (unknown) (unknown)
Fatal Python error: Segmentation fault
Good news, the latest commit seems to work. So the issue seems to be already fixed in develop, but exists in the latest release ;)
I.e. the wheels from https://github.com/sys-bio/roadrunner/commit/05b9c7664e097c1b85e11a63a3fef754d22b1f4a https://github.com/sys-bio/roadrunner/runs/5165861615 worked
Could you make a bugfix release 2.2.1, the current version 2.2.0 is not working?
edit unfortunately still not working
Now getting the them segmentation faults also with the latest develop version (see below). My feeling is that these issues are related to
ray
from roadrunner, which would also allow the py3.10 releaseThis is a big issue, because I cannot get any roadrunner version after December working. Things are working with
numpy==1.21.2
libroadrunner (version until Dec 2021)
All versions from January after the parallelization result in segmentation faults.
(pid=2110313) [2022-02-15 12:42:20,690 E 2110313 2110313] logging.cc:313: *** SIGSEGV received at time=1644925340 on cpu 29 ***
(pid=2110313) [2022-02-15 12:42:20,690 E 2110313 2110313] logging.cc:313: PC: @ 0x7fe0c91e3bd2 (unknown) std::default_delete<>::operator()()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe3b295a210 (unknown) (unknown)
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c91e2292 64 std::unique_ptr<>::~unique_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c9227cea 32 rrllvm::Jit::~Jit()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c923b904 32 rrllvm::MCJit::~MCJit()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c923b920 32 rrllvm::MCJit::~MCJit()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c91b81ca 32 std::default_delete<>::operator()()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c91b7972 64 std::unique_ptr<>::~unique_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c9201644 848 rrllvm::ModelResources::~ModelResources()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c9197ddc 48 std::_Sp_counted_ptr<>::_M_dispose()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c909414d 128 std::_Sp_counted_base<>::_M_release()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c908a91b 32 std::__shared_count<>::~__shared_count()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c9191710 32 std::__shared_ptr<>::~__shared_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c919172c 32 std::shared_ptr<>::~shared_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c918348b 432 rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c91834ee 32 rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c91071f5 32 std::default_delete<>::operator()()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c9103660 64 std::unique_ptr<>::~unique_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c91019a9 448 rr::RoadRunnerImpl::~RoadRunnerImpl()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c90d1142 48 rr::RoadRunner::~RoadRunner()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c90d116e 32 rr::RoadRunner::~RoadRunner()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c9037faa 112 _wrap_delete_RoadRunner
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x7fe0c90038fb 144 SwigPyObject_dealloc
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313: @ 0x5d1e78 (unknown) (unknown)
(pid=2110313) [2022-02-15 12:42:20,692 E 2110313 2110313] logging.cc:313: @ 0x90bf00 (unknown) (unknown)
(pid=2110313) Fatal Python error: Segmentation fault
(pid=2110313)
(pid=2110313) Stack (most recent call first):
(pid=2110313) File "/home/mkoenig/git/sbmlsim/src/sbmlsim/simulator/simulation_ray.py", line 47 in set_model
(pid=2110313) File "/home/mkoenig/envs/pkdb_models/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451 in _resume_span
(pid=2110313) File "/home/mkoenig/envs/pkdb_models/lib/python3.8/site-packages/ray/_private/function_manager.py", line 576 in actor_method_executor
(pid=2110313) File "/home/mkoenig/envs/pkdb_models/lib/python3.8/site-packages/ray/worker.py", line 425 in main_loop
(pid=2110313) File "/home/mkoenig/envs/pkdb_models/lib/python3.8/site-packages/ray/workers/default_worker.py", line 218 in <module>
(pid=2110308) @ 0x7fd86db9891b 32 std::__shared_count<>::~__shared_count()
(pid=2110308) @ 0x7fd86dc9f710 32 std::__shared_ptr<>::~__shared_ptr()
(pid=2110308) @ 0x7fd86dc9f72c 32 std::shared_ptr<>::~shared_ptr()
(pid=2110308) @ 0x7fd86dc9148b 432 rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
(pid=2110308) @ 0x7fd86dc914ee 32 rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
(pid=2110308) @ 0x7fd86dc151f5 32 std::default_delete<>::operator()()
(pid=2110308) @ 0x7fd86dc11660 64 std::unique_ptr<>::~unique_ptr()
(pid=2110308) @ 0x7fd86dc0f9a9 448 rr::RoadRunnerImpl::~RoadRunnerImpl()
(pid=2110308) @ 0x7fd86dbdf142 48 rr::RoadRunner::~RoadRunner()
(pid=2110308) @ 0x7fd86dbdf16e 32 rr::RoadRunner::~RoadRunner()
(pid=2110308) @ 0x7fd86db45faa 112 _wrap_delete_RoadRunner
(pid=2110308) @ 0x7fd86db118fb 144 SwigPyObject_dealloc
(pid=2110308) @ 0x5d1e78 (unknown) (unknown)
(pid=2110308) @ 0x90bf00 (unknown) (unknown)
Ciaran, I think we need to remove the multiprocessing support Until the code has been tested mire.
See latest issue from Matthias, seems latest develop is more stable.
I doubt the crash has anything to do with ray, as it's not used internally at all, and is not even installed with the distribution; it's just used for internal testing. (i.e. it's not in requirements.txt; it's just in test-requirements.txt) However, since we're not using it for much, we might as well remove it; it's just that I don't think it'll help the crashes.
I think instead the problem is likely to be that Arch Linux is incompatible with manylinux2014. This would explain why it refuses to install, and why there are mysterious crashes when you try anyway. Given that the various flavors of linux have long been binary incompatible, this isn't so much a surprise as the surprise that 'old CentOS' was as cross-compatible as it was for so long. It certainly used to be the case that if you used linux, you were going to have to compile your own programs; we may be returning to those days.
I'll see if we can set up a build for arch linux directly and do some research to find out if there's a new cross-compatible linux format we can switch to in general.
Out of interest, do we still have the same problems with the LLJit compiler?
from roadrunner import Config
Config.setValue(Config.LLVM_BACKEND, Config.LLJIT)
If this is an Arch Linux problem, we should consider adding a docker image for this platform in azure.
Some feedback:
ray
. It was just a thought if different versions of the ray are installed the plasma store used by ray could be incompatible. But, yes I think this is not the case.The LLJit compiler did not work, just gave other segmentation faults.
Still trying to create a minimal example, but all small test cases work. So it could be a problem with how I handle the state and bring it to the other cores (some raise conditions or similar issues). It could be that the speedup in model loading on the SBML side resulted in the bug to just show up on my side (e.g. the SBML loading was so slow that there was never a problem before, but now it is so fast that some raise conditions create an issue).
Sorry you are having these issues Matthias. We released because all of our tests passed, though admittedly we should not have jumped the gun and skipped the experimental release. I'll wait on the MWE and then will take a look at this issue (guessing before we've seen this won't be all that fruitful). Best,
Ciaran
Would it be worth attempting to compile roadrunner on your machine directly? I believe llvm13 is available via apt, and everything else is available vai libroadrunner-deps. If it doesn't solve the problem, that's at least one possible cause we could eliminate. I think I would only try this if it proved particularly difficult to come up with an example, since that example failing on a different machine would also prove the same thing.
I could reproduce the issue on my laptop, i.e. different computer. So this is definitely a real bug.
I am working on narrowing things down. The segmentation faults seems to be due to model loading via the state on the multiple cores. I think the underlying cause is that the model loading from SBML is now much faster (without reading the SBML multiple times), so that probably some raise conditions appear (such as files not written/closed yet).
Nothing to do for now, I have the feeling this could be on my side on how I distribute the state on multiple cores.
Well, definitely keep us in the loop! We're more than happy to fix anything on our end if it turns out to be us (or even if we can make things easier on you with some sort of change in the new code).
I think the problem wil be fixed by putting another lock in somewhere around some critical code. I suspect there is a data race somewhere… These are hard to find except with thread sanitizer.
Ciaran can you work on this with Matthias?
Yep, absolutely. @matthiaskoenig do you have some code that we can run to try to reproduce the problem? Don’t worry if it not super “minimum working example” or anything - at this point we just need to see what kind of situation causes the problem you’re seeing (even if it’s on your end and not ours). Thanks
@matthiaskoenig : Did this issue get solved?
No this is still a major issue for us, so that we have fixed for many workflows roadrunner to 2.1.3. We wanted to completely redesign our workflows around ray to see if this can get fixed. This happens when we run larger realworld workflows with larger models, but we could not get a minimal example for this so far. I think this is probably some raise condition when we bring large models on many cores and we just don't see this with a few cores and small examples. I will have a look again next week.
Duplicate of #1210, which creates a minimal example for the issue.
Not sure what happened since around mid of December, but the latest
libroadrunner==2.2.0
andlibroadrunner-experimental==2.2.0
results in segmentation faults and dying linux kernels on most of my workflows/simulations.I.e. I get things such as
I am pretty sure it is code related to the following: https://github.com/sys-bio/roadrunner/issues/925 (merged beginning of January), i.e. the internal parallelization. Please, please provide an roadrunner without any internal parallization, i.e. a single python thread on a single core! This will create issues in any multiprocessing on clusters. The current libroadrunner=2.2.0 is not working for me at all. This is a big issue, because it breaks the scripts of all my students at the moment.