sys-bio / roadrunner

libRoadRunner: A high-performance SBML simulator
http://libroadrunner.org/
Other
39 stars 24 forks source link

libroadrunner==2.2.0 results in segmentation faults on almost all my models/workflows #963

Closed matthiaskoenig closed 5 months ago

matthiaskoenig commented 2 years ago

Not sure what happened since around mid of December, but the latest libroadrunner==2.2.0 and libroadrunner-experimental==2.2.0 results in segmentation faults and dying linux kernels on most of my workflows/simulations.

I.e. I get things such as

*** SIGSEGV received at time=1644920070 on cpu 10 ***
PC: @     0x7fcad4f2851e  (unknown)  std::default_delete<>::operator()()
    @     0x7fcb388d93c0  1075403472  (unknown)
    @     0x7fcad4f26bde         64  std::unique_ptr<>::~unique_ptr()
    @     0x7fcad4f6c636         32  rrllvm::Jit::~Jit()
    @     0x7fcad4f80250         32  rrllvm::MCJit::~MCJit()
    @     0x7fcad4f8026c         32  rrllvm::MCJit::~MCJit()
    @     0x7fcad4efcb16         32  std::default_delete<>::operator()()
    @     0x7fcad4efc2be         64  std::unique_ptr<>::~unique_ptr()
    @     0x7fcad4f45f90        848  rrllvm::ModelResources::~ModelResources()
    @     0x7fcad4edc728         48  std::_Sp_counted_ptr<>::_M_dispose()
    @     0x7fcad4dd8c2d        128  std::_Sp_counted_base<>::_M_release()
    @     0x7fcad4dcf3fb         32  std::__shared_count<>::~__shared_count()
    @     0x7fcad4ed605c         32  std::__shared_ptr<>::~__shared_ptr()
    @     0x7fcad4ed6078         32  std::shared_ptr<>::~shared_ptr()
    @     0x7fcad4ec7f23        432  rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
    @     0x7fcad4ec7f86         32  rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
    @     0x7fcad4e4bc9f         32  std::default_delete<>::operator()()
    @     0x7fcad4e4810a         64  std::unique_ptr<>::~unique_ptr()
    @     0x7fcad4e46453        448  rr::RoadRunnerImpl::~RoadRunnerImpl()
    @     0x7fcad4e15dce         48  rr::RoadRunner::~RoadRunner()
    @     0x7fcad4e15dfa         32  rr::RoadRunner::~RoadRunner()
    @     0x7fcad4d7c962        112  _wrap_delete_RoadRunner
    @     0x7fcad4d498c0        144  SwigPyObject_dealloc
    @           0x532b95  (unknown)  (unknown)
    @           0x8feca0  (unknown)  (unknown)
[2022-02-15 11:14:30,802 E 2063541 2063541] logging.cc:317: *** SIGSEGV received at time=1644920070 on cpu 10 ***
[2022-02-15 11:14:30,802 E 2063541 2063541] logging.cc:317: PC: @     0x7fcad4f2851e  (unknown)  std::default_delete<>::operator()()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcb388d93c0  1075403472  (unknown)
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4f26bde         64  std::unique_ptr<>::~unique_ptr()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4f6c636         32  rrllvm::Jit::~Jit()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4f80250         32  rrllvm::MCJit::~MCJit()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4f8026c         32  rrllvm::MCJit::~MCJit()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4efcb16         32  std::default_delete<>::operator()()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4efc2be         64  std::unique_ptr<>::~unique_ptr()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4f45f90        848  rrllvm::ModelResources::~ModelResources()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4edc728         48  std::_Sp_counted_ptr<>::_M_dispose()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4dd8c2d        128  std::_Sp_counted_base<>::_M_release()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4dcf3fb         32  std::__shared_count<>::~__shared_count()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4ed605c         32  std::__shared_ptr<>::~__shared_ptr()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4ed6078         32  std::shared_ptr<>::~shared_ptr()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4ec7f23        432  rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4ec7f86         32  rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4e4bc9f         32  std::default_delete<>::operator()()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4e4810a         64  std::unique_ptr<>::~unique_ptr()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4e46453        448  rr::RoadRunnerImpl::~RoadRunnerImpl()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4e15dce         48  rr::RoadRunner::~RoadRunner()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4e15dfa         32  rr::RoadRunner::~RoadRunner()
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4d7c962        112  _wrap_delete_RoadRunner
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @     0x7fcad4d498c0        144  SwigPyObject_dealloc
[2022-02-15 11:14:30,803 E 2063541 2063541] logging.cc:317:     @           0x532b95  (unknown)  (unknown)
[2022-02-15 11:14:30,804 E 2063541 2063541] logging.cc:317:     @           0x8feca0  (unknown)  (unknown)
Fatal Python error: Segmentation fault

I am pretty sure it is code related to the following: https://github.com/sys-bio/roadrunner/issues/925 (merged beginning of January), i.e. the internal parallelization. Please, please provide an roadrunner without any internal parallization, i.e. a single python thread on a single core! This will create issues in any multiprocessing on clusters. The current libroadrunner=2.2.0 is not working for me at all. This is a big issue, because it breaks the scripts of all my students at the moment.

matthiaskoenig commented 2 years ago

Trying to figure this out and will provide updates along the way. Will try to provide a minimal example.

edit With the old commit from december I get the following segmentation faults

*** SIGSEGV received at time=1644922989 on cpu 11 ***
PC: @     0x7f220d45948b  (unknown)  rr::doublematrix_to_py()
    @     0x7f2262f47210  544876752  (unknown)
    @     0x7f220d3c838c  (unknown)  _wrap_RoadRunner__simulate
    @          0x5cdb220  149681456  (unknown)
    @     0x7f220d46ad30  (unknown)  (unknown)
    @ 0x10c08348086f8b48  (unknown)  (unknown)
[2022-02-15 12:03:09,551 E 2086665 2086665] logging.cc:317: *** SIGSEGV received at time=1644922989 on cpu 11 ***
[2022-02-15 12:03:09,551 E 2086665 2086665] logging.cc:317: PC: @     0x7f220d45948b  (unknown)  rr::doublematrix_to_py()
[2022-02-15 12:03:09,552 E 2086665 2086665] logging.cc:317:     @     0x7f2262f47210  544876752  (unknown)
[2022-02-15 12:03:09,552 E 2086665 2086665] logging.cc:317:     @     0x7f220d3c838c  (unknown)  _wrap_RoadRunner__simulate
[2022-02-15 12:03:09,553 E 2086665 2086665] logging.cc:317:     @          0x5cdb220  149681456  (unknown)
[2022-02-15 12:03:09,557 E 2086665 2086665] logging.cc:317:     @     0x7f220d46ad30  (unknown)  (unknown)
[2022-02-15 12:03:09,558 E 2086665 2086665] logging.cc:317:     @ 0x10c08348086f8b48  (unknown)  (unknown)
Fatal Python error: Segmentation fault
matthiaskoenig commented 2 years ago

Good news, the latest commit seems to work. So the issue seems to be already fixed in develop, but exists in the latest release ;)

I.e. the wheels from https://github.com/sys-bio/roadrunner/commit/05b9c7664e097c1b85e11a63a3fef754d22b1f4a https://github.com/sys-bio/roadrunner/runs/5165861615 worked

Could you make a bugfix release 2.2.1, the current version 2.2.0 is not working?

edit unfortunately still not working

matthiaskoenig commented 2 years ago

Now getting the them segmentation faults also with the latest develop version (see below). My feeling is that these issues are related to

  1. including ray dependencies with roadrunner and then running roadrunner with multiprocessing/ray. I could imagine that there are clashes somehow. So it would be great to remove ray from roadrunner, which would also allow the py3.10 release
  2. new multiprocessing/threading code in roadrunner. This could create the issues when running workflows with multiprocessing/ray which clashes with the internal roadrunner parallelization. roadrunner should just be a single thread program, or there must be an option for that. I did not see any issues before the parallelization updates in January, so this is most likely the problem.

This is a big issue, because I cannot get any roadrunner version after December working. Things are working with

numpy==1.21.2
libroadrunner (version until Dec 2021)

All versions from January after the parallelization result in segmentation faults.

(pid=2110313) [2022-02-15 12:42:20,690 E 2110313 2110313] logging.cc:313: *** SIGSEGV received at time=1644925340 on cpu 29 ***
(pid=2110313) [2022-02-15 12:42:20,690 E 2110313 2110313] logging.cc:313: PC: @     0x7fe0c91e3bd2  (unknown)  std::default_delete<>::operator()()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe3b295a210  (unknown)  (unknown)
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c91e2292         64  std::unique_ptr<>::~unique_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c9227cea         32  rrllvm::Jit::~Jit()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c923b904         32  rrllvm::MCJit::~MCJit()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c923b920         32  rrllvm::MCJit::~MCJit()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c91b81ca         32  std::default_delete<>::operator()()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c91b7972         64  std::unique_ptr<>::~unique_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c9201644        848  rrllvm::ModelResources::~ModelResources()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c9197ddc         48  std::_Sp_counted_ptr<>::_M_dispose()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c909414d        128  std::_Sp_counted_base<>::_M_release()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c908a91b         32  std::__shared_count<>::~__shared_count()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c9191710         32  std::__shared_ptr<>::~__shared_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c919172c         32  std::shared_ptr<>::~shared_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c918348b        432  rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c91834ee         32  rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c91071f5         32  std::default_delete<>::operator()()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c9103660         64  std::unique_ptr<>::~unique_ptr()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c91019a9        448  rr::RoadRunnerImpl::~RoadRunnerImpl()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c90d1142         48  rr::RoadRunner::~RoadRunner()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c90d116e         32  rr::RoadRunner::~RoadRunner()
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c9037faa        112  _wrap_delete_RoadRunner
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @     0x7fe0c90038fb        144  SwigPyObject_dealloc
(pid=2110313) [2022-02-15 12:42:20,691 E 2110313 2110313] logging.cc:313:     @           0x5d1e78  (unknown)  (unknown)
(pid=2110313) [2022-02-15 12:42:20,692 E 2110313 2110313] logging.cc:313:     @           0x90bf00  (unknown)  (unknown)
(pid=2110313) Fatal Python error: Segmentation fault
(pid=2110313) 
(pid=2110313) Stack (most recent call first):
(pid=2110313)   File "/home/mkoenig/git/sbmlsim/src/sbmlsim/simulator/simulation_ray.py", line 47 in set_model
(pid=2110313)   File "/home/mkoenig/envs/pkdb_models/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451 in _resume_span
(pid=2110313)   File "/home/mkoenig/envs/pkdb_models/lib/python3.8/site-packages/ray/_private/function_manager.py", line 576 in actor_method_executor
(pid=2110313)   File "/home/mkoenig/envs/pkdb_models/lib/python3.8/site-packages/ray/worker.py", line 425 in main_loop
(pid=2110313)   File "/home/mkoenig/envs/pkdb_models/lib/python3.8/site-packages/ray/workers/default_worker.py", line 218 in <module>
(pid=2110308)     @     0x7fd86db9891b         32  std::__shared_count<>::~__shared_count()
(pid=2110308)     @     0x7fd86dc9f710         32  std::__shared_ptr<>::~__shared_ptr()
(pid=2110308)     @     0x7fd86dc9f72c         32  std::shared_ptr<>::~shared_ptr()
(pid=2110308)     @     0x7fd86dc9148b        432  rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
(pid=2110308)     @     0x7fd86dc914ee         32  rrllvm::LLVMExecutableModel::~LLVMExecutableModel()
(pid=2110308)     @     0x7fd86dc151f5         32  std::default_delete<>::operator()()
(pid=2110308)     @     0x7fd86dc11660         64  std::unique_ptr<>::~unique_ptr()
(pid=2110308)     @     0x7fd86dc0f9a9        448  rr::RoadRunnerImpl::~RoadRunnerImpl()
(pid=2110308)     @     0x7fd86dbdf142         48  rr::RoadRunner::~RoadRunner()
(pid=2110308)     @     0x7fd86dbdf16e         32  rr::RoadRunner::~RoadRunner()
(pid=2110308)     @     0x7fd86db45faa        112  _wrap_delete_RoadRunner
(pid=2110308)     @     0x7fd86db118fb        144  SwigPyObject_dealloc
(pid=2110308)     @           0x5d1e78  (unknown)  (unknown)
(pid=2110308)     @           0x90bf00  (unknown)  (unknown)
hsauro commented 2 years ago

Ciaran, I think we need to remove the multiprocessing support Until the code has been tested mire.

hsauro commented 2 years ago

See latest issue from Matthias, seems latest develop is more stable.

luciansmith commented 2 years ago

I doubt the crash has anything to do with ray, as it's not used internally at all, and is not even installed with the distribution; it's just used for internal testing. (i.e. it's not in requirements.txt; it's just in test-requirements.txt) However, since we're not using it for much, we might as well remove it; it's just that I don't think it'll help the crashes.

I think instead the problem is likely to be that Arch Linux is incompatible with manylinux2014. This would explain why it refuses to install, and why there are mysterious crashes when you try anyway. Given that the various flavors of linux have long been binary incompatible, this isn't so much a surprise as the surprise that 'old CentOS' was as cross-compatible as it was for so long. It certainly used to be the case that if you used linux, you were going to have to compile your own programs; we may be returning to those days.

I'll see if we can set up a build for arch linux directly and do some research to find out if there's a new cross-compatible linux format we can switch to in general.

CiaranWelsh commented 2 years ago

Out of interest, do we still have the same problems with the LLJit compiler?

from roadrunner import Config
Config.setValue(Config.LLVM_BACKEND, Config.LLJIT)
CiaranWelsh commented 2 years ago

If this is an Arch Linux problem, we should consider adding a docker image for this platform in azure.

matthiaskoenig commented 2 years ago

Some feedback:

matthiaskoenig commented 2 years ago

The LLJit compiler did not work, just gave other segmentation faults.

Still trying to create a minimal example, but all small test cases work. So it could be a problem with how I handle the state and bring it to the other cores (some raise conditions or similar issues). It could be that the speedup in model loading on the SBML side resulted in the bug to just show up on my side (e.g. the SBML loading was so slow that there was never a problem before, but now it is so fast that some raise conditions create an issue).

CiaranWelsh commented 2 years ago

Sorry you are having these issues Matthias. We released because all of our tests passed, though admittedly we should not have jumped the gun and skipped the experimental release. I'll wait on the MWE and then will take a look at this issue (guessing before we've seen this won't be all that fruitful). Best,

Ciaran

luciansmith commented 2 years ago

Would it be worth attempting to compile roadrunner on your machine directly? I believe llvm13 is available via apt, and everything else is available vai libroadrunner-deps. If it doesn't solve the problem, that's at least one possible cause we could eliminate. I think I would only try this if it proved particularly difficult to come up with an example, since that example failing on a different machine would also prove the same thing.

matthiaskoenig commented 2 years ago

I could reproduce the issue on my laptop, i.e. different computer. So this is definitely a real bug.

I am working on narrowing things down. The segmentation faults seems to be due to model loading via the state on the multiple cores. I think the underlying cause is that the model loading from SBML is now much faster (without reading the SBML multiple times), so that probably some raise conditions appear (such as files not written/closed yet).

Nothing to do for now, I have the feeling this could be on my side on how I distribute the state on multiple cores.

luciansmith commented 2 years ago

Well, definitely keep us in the loop! We're more than happy to fix anything on our end if it turns out to be us (or even if we can make things easier on you with some sort of change in the new code).

CiaranWelsh commented 2 years ago

I think the problem wil be fixed by putting another lock in somewhere around some critical code. I suspect there is a data race somewhere… These are hard to find except with thread sanitizer.

hsauro commented 2 years ago

Ciaran can you work on this with Matthias?

CiaranWelsh commented 2 years ago

Yep, absolutely. @matthiaskoenig do you have some code that we can run to try to reproduce the problem? Don’t worry if it not super “minimum working example” or anything - at this point we just need to see what kind of situation causes the problem you’re seeing (even if it’s on your end and not ours). Thanks

luciansmith commented 10 months ago

@matthiaskoenig : Did this issue get solved?

matthiaskoenig commented 10 months ago

No this is still a major issue for us, so that we have fixed for many workflows roadrunner to 2.1.3. We wanted to completely redesign our workflows around ray to see if this can get fixed. This happens when we run larger realworld workflows with larger models, but we could not get a minimal example for this so far. I think this is probably some raise condition when we bring large models on many cores and we just don't see this with a few cores and small examples. I will have a look again next week.

matthiaskoenig commented 5 months ago

Duplicate of #1210, which creates a minimal example for the issue.