luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Octopus revision 6f1869d suffers from intermittent segmentation fault issue #93

Closed ahstram closed 4 years ago

ahstram commented 4 years ago

Hi there,

I am running Octopus revision 6f1869d with the following parameters:

--threads 16
-R /reference/env/dev/content/hs37d5/raw/hs37d5.fa 
-I tumor.bam normal.bam 
--bamout /path/to/bamout
--bamout-type FULL
--normal-sample normal_name 
-t /path/to/eval.bed
--fast
--working-directory /path/to/workdir
--annotations forest
--disable-call-filtering 
-o /path/to/out.vcf
--config /path/to/octopus.cfg

I am experiencing intermittent segmentation fault issues, which I've posted the gdb backtrace below:

#0  0x000000340533eaa3 in __memcmp_sse4_1 () from /lib64/libc.so.6
#1  0x00000000008822f8 in bool octopus::operator==<octopus::GenomicRegion>(octopus::BasicAllele<octopus::GenomicRegion> const&, octopus::BasicAllele<octopus::GenomicRegion> const&) ()
#2  0x0000000000823698 in octopus::VcfRecordFactory::make(std::vector<octopus::CallWrapper, std::allocator<octopus::CallWrapper> >&&) const ()
#3  0x000000000090474a in octopus::Caller::call(octopus::GenomicRegion const&, octopus::ProgressMeter&) const ()
#4  0x000000000063b6cb in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<octopus::(anonymous namespace)::CompletedTask>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<octopus::(anonymous namespace)::run(octopus::(anonymous namespace)::Task, octopus::ContigCallingComponents, octopus::(anonymous namespace)::CallerSyncPacket&)::{lambda()#1}> >, octopus::(anonymous namespace)::CompletedTask> >::_M_invoke(std::_Any_data const&) ()
#5  0x00000000008c87eb in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) ()
#6  0x000000340560ce03 in pthread_once () from /lib64/libpthread.so.0
#7  0x00000000005e3c2b in std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) [clone .constprop.1] ()
#8  0x000000000063643a in _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJZN7octopus12_GLOBAL__N_13runENS6_4TaskENS5_23ContigCallingComponentsERNS6_16CallerSyncPacketEEUlvE_EEEENS6_13CompletedTaskEEC4EOSD_EUlvE_EEEEE6_M_runEv.lto_priv.0 ()
#9  0x00002ac8fcd6dbd0 in execute_native_thread_routine () at ../../../../../gcc-9.1.0-src/libstdc++-v3/src/c++11/thread.cc:80
#10 0x0000003405607aa1 in start_thread () from /lib64/libpthread.so.0
#11 0x00000034052e8c4d in lseek64 () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

After a segmentation fault, if I simply rerun the command which caused a seg-fault, Octopus will run successfully, which makes it hard to pinpoint this issue.

I can post the actual core dump somewhere if necessary, but it is ~9GB in size, so I'm hoping the above backtrace will suffice.

Is the above seg-fault a known issue? If so, has it been fixed in any revision since 6f1869d?

Since this has been an intermittent issue for me, I am consulting with you before doing a "brute force" approach to figuring out if the issue has been resolved in newer revisions. It is not immediately clear to me based on the git log and existing issue reports.

Thank you!

Alex

dancooke commented 4 years ago

Hi Alex,

Thanks for the bug report. It's difficult to say whether this issue has been fixed downstream - I've not made any modifications to the code that's failing since 6f1869dc79d45898ac96dedc9b730a804427162e, but it's possible that the real problem is in some upstream code that breaks a precondition of the failing code that has now been fixed.

Would you mind trying the latest develop version? It may be worth rebuilding with sanitizers (just add the --sanitize option to install.py, you'll get the binary octopus-sanitize instead of octopus).

Also, how did you install Octopus? These types of intermittent bugs can be a sign of compiler bugs (of which I have found numerous over the course of developing Octopus).

Cheers Dan

ahstram commented 4 years ago

@dancooke Dan,

Yes, sure thing, I've already deployed commit 8994904 to my development environment, so it's not an issue to test.

Earlier today I submitted 10 runs of octopus 6f1869d using the same input data, and 1 of them has already seg-faulted (none of the remaining 9 have finished running yet, so any of those 9 could also seg-fault), so I suspect this issue is prevalent enough that I can reasonably detect via brute-force methods.

I'll submit another 20 tests using octopus 8994904, which may be a high enough number to conclude with some level of confidence that the bug has been resolved.

Ah yes, compiler issues... Something I've been avoiding. I had a heck of a time getting Octopus 0.6.7-beta deployed on our archaic CentOs 6.9 system, so I had to do a bit of hacking...

here's what the heart of our current build.sh looks like:

find . -name 'CMakeLists.txt' -exec sed -i 's/-Werror//g' {} \;

cmake -DCMAKE_C_COMPILER=/path/to/gcc-9.1.0/bin/gcc \
    -DCMAKE_CXX_COMPILER=/path/to/bin/g++ \
    -DCMAKE_C_FLAGS="-I/path/to/gmp-6.1.2/include/" \
    -DCMAKE_CXX_FLAGS="-I/path/to/gmp-6.1.2/include/" \
    -DCMAKE_BUILD_TYPE=Release \
    -DGMPlib_ROOT=/path/to/gmp-6.1.2/ \
    -DBOOST_ROOT=/path/to/boost_1_70_0 \
    -DHTSLIB_ROOT=/path/to/htslib-1.9/

make || cd "$DIR/src" && $(< CMakeFiles/octopus.dir/link.txt \
    sed 's/libboost_iostreams.so/libboost_iostreams.a/g' \
    | sed 's/$/\/reference\/common\/apps\/zlib-1.2.8\/lib\/libz.a/')

mkdir -pv "$DIR/bin"
install -vm555 "$DIR/src/octopus" "$DIR/bin/octopus"

So, I'm making some modifications:

First, removing -Werror flags from the "add_compile_options" function in each CMakeLists.txt file, since I was running into "deprecated-copy" compile errors like:

/reference/common/apps/boost_1_70_0/include/boost/phoenix/core/actor.hpp: In copy constructor ‘boost::phoenix::actor<boost::log::v2_mt_posix::expressions::aux::unary_function_terminal<boost::log::v2_mt_posix::expressions::has_attribute<void> > >::actor(const boost::phoenix::actor<boost::log::v2_mt_posix::expressions::aux::unary_function_terminal<boost::log::v2_mt_posix::expressions::has_attribute<void> > >&)’:
/reference/common/apps/boost_1_70_0/include/boost/phoenix/core/actor.hpp:133:12: error: implicitly-declared ‘boost::proto::exprns_::expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<boost::log::v2_mt_posix::expressions::aux::unary_function_terminal<boost::log::v2_mt_posix::expressions::has_attribute<void> > >, 0>::expr(const boost::proto::exprns_::expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<boost::log::v2_mt_posix::expressions::aux::unary_function_terminal<boost::log::v2_mt_posix::expressions::has_attribute<void> > >, 0>&)’ is deprecated [-Werror=deprecated-copy]

Then, running cmake with a number of variables set.

Finally, I expect "make" to fail at the very last step due to some linking issues, so I run the linking command directly from CMakeFiles/octopus.dir/link.txt, with some modifications.

I'll let you know about the status of the 8994904 tests. I will go with my current build, to make sure the problem still exists, before trying to sanitize the build.

Please let me know if you think my non-standard build process may be causing issues. I have not had issues with the non-standard build process on previous revisions of Octopus.

Thanks,

Alex

dancooke commented 4 years ago

Your build process looks fine, although there are newer version of GCC and Boost available so that's one thing to check. The install.py installation script should work for CentOS 6 now. You'll need to make sure that you have Git 1.8+ installed somewhere and in your path before system Git (which is < 1.8 pre-installed on CentOS 6), plus Python 3 installed. Then hopefully you can just do:

$ ./scripts/install.py --install-dependencies

This will install all dependencies into octopus/build via Homebrew, which usually has the most up-to-date GCC and Boost available.

ahstram commented 4 years ago

Hi @dancooke,

Looks like the issue is also present in commit 8994904, here's a backtrace from one of my test runs with 8994904 that is identical to the backtraces from commit 6f1869d.

#0  0x0000003ea193eaa3 in __memcmp_sse4_1 () from /lib64/libc.so.6
#1  0x0000000000b8f458 in bool octopus::operator==<octopus::GenomicRegion>(octopus::BasicAllele<octopus::GenomicRegion> const&, octopus::BasicAllele<octopus::GenomicRegion> const&) ()
#2  0x000000000089b163 in octopus::VcfRecordFactory::make(std::vector<octopus::CallWrapper, std::allocator<octopus::CallWrapper> >&&) const ()
#3  0x0000000000cac962 in octopus::Caller::call(octopus::GenomicRegion const&, octopus::ProgressMeter&) const ()
#4  0x00000000006af49b in _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIN7octopus12_GLOBAL__N_113CompletedTaskEEES3_ENSt6thread8_InvokerISt5tupleIJZNS9_3runENS9_4TaskENS8_23ContigCallingComponentsERNS9_16CallerSyncPacketEEUlvE_EEEESA_EEE9_M_invokeERKSt9_Any_data ()
#5  0x0000000000c6dc8b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()()>*, bool*) ()
#6  0x0000003ea1c0ce03 in pthread_once () from /lib64/libpthread.so.0
#7  0x000000000061ebcb in std::__future_base::_State_baseV2::_M_set_result ()
#8  0x00000000006a897a in _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJZN7octopus12_GLOBAL__N_13runENS6_4TaskENS5_23ContigCallingComponentsERNS6_16CallerSyncPacketEEUlvE_EEEENS6_13CompletedTaskEEC4EOSD_EUlvE_EEEEE6_M_runEv.lto_priv.0 ()
#9  0x00002b4153e1bbd0 in execute_native_thread_routine (__p=0x3de63490) at ../../../../../gcc-9.1.0-src/libstdc++-v3/src/c++11/thread.cc:80
#10 0x0000003ea1c07aa1 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003ea18e8c4d in lseek64 () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

I will try getting commit 6f1869d compiled with the "install.py" script & retest, to ensure that this isn't a compiler-related issue.

I should have word on the re-compile test within a day or so.

Thanks,

Alex

ahstram commented 4 years ago

Hi @dancooke,

I recompiled 6f1869d using GCC 9.2.0 (latest version) and Boost 1.71.0 (latest version). I wasn't able to use the "install.py" script, but these are the commands I did use:

cmake .. -DCMAKE_C_COMPILER=/path/to/gcc-9.2.0/bin/gcc -DCMAKE_CXX_COMPILER=/path/to/gcc-9.2.0/bin/g++ -DCMAKE_C_FLAGS="-I/path/to/gmp-6.1.2/include/" -DCMAKE_CXX_FLAGS="-I/path/to/gmp-6.1.2/include/ -Wno-deprecated-copy -Wno-pessimizing-move" -DCMAKE_BUILD_TYPE=Release -DGMPlib_ROOT=/path/to/gmp-6.1.2/ -DBOOST_ROOT=/path/to/boost_1_71_0 -DHTSLIB_ROOT=/path/to/htslib-1.9/
make || $(< CMakeFiles/octopus.dir/link.txt \
 sed 's/libboost_iostreams.so/libboost_iostreams.a/g' \
 | sed 's/$/\/reference\/common\/apps\/zlib-1.2.8\/lib\/libz.a/')

The main difference, in addition to using newer versions of GCC and Boost, is that I'm now explicitly using "-Wno-deprecated-copy -Wno-pessimizing-move" to call out warnings to ignore, rather than editing the CMakeLists.txt file to remove "-Werror".

It does appear that this intermittent segfault issue is appearing much less frequently now, although it still does occur on occasion. Also, I am getting two different backtraces now:

This is a new backtrace message (4 occurrences over past 36 hours of testing):

#0  0x0000003622f3eaa0 in __memcmp_sse4_1 () from /lib64/libc.so.6
#1  0x0000000000823d8f in octopus::VcfRecordFactory::make(std::vector<octopus::CallWrapper, std::allocator<octopus::CallWrapper> >&&) const ()
#2  0x00000000009064aa in octopus::Caller::call(octopus::GenomicRegion const&, octopus::ProgressMeter&) const ()
#3  0x000000000063dd4b in _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIN7octopus12_GLOBAL__N_113CompletedTaskEEES3_ENSt6thread8_InvokerISt5tupleIJZNS9_3runENS9_4TaskENS8_23ContigCallingComponentsERNS9_16CallerSyncPacketEEUlvE_EEEESA_EEE9_M_invokeERKSt9_Any_data ()
#4  0x00000000008c736b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()()>*, bool*) ()
#5  0x000000362320ce03 in pthread_once () from /lib64/libpthread.so.0
#6  0x00000000005e527b in std::__future_base::_State_baseV2::_M_set_result ()
#7  0x0000000000638b0a in _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJZN7octopus12_GLOBAL__N_13runENS6_4TaskENS5_23ContigCallingComponentsERNS6_16CallerSyncPacketEEUlvE_EEEENS6_13CompletedTaskEEC4EOSD_EUlvE_EEEEE6_M_runEv.lto_priv.0 ()
#8  0x00002b344f346bd0 in execute_native_thread_routine (__p=0x691bd00) at ../../../../../gcc-9.1.0-src/libstdc++-v3/src/c++11/thread.cc:80
#9  0x0000003623207aa1 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003622ee8c4d in lseek64 () from /lib64/libc.so.6

And here's the backtrace as previously observed, which still does occur, but rarely (1 occurrence over past 36 hours of testing):

#0  0x0000003d71d3eaa3 in __memcmp_sse4_1 () from /lib64/libc.so.6
#1  0x0000000000883148 in bool octopus::operator==<octopus::GenomicRegion>(octopus::BasicAllele<octopus::GenomicRegion> const&, octopus::BasicAllele<octopus::GenomicRegion> const&) ()
#2  0x0000000000823d08 in octopus::VcfRecordFactory::make(std::vector<octopus::CallWrapper, std::allocator<octopus::CallWrapper> >&&) const ()
#3  0x00000000009064aa in octopus::Caller::call(octopus::GenomicRegion const&, octopus::ProgressMeter&) const ()
#4  0x000000000063dd4b in _ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultIN7octopus12_GLOBAL__N_113CompletedTaskEEES3_ENSt6thread8_InvokerISt5tupleIJZNS9_3runENS9_4TaskENS8_23ContigCallingComponentsERNS9_16CallerSyncPacketEEUlvE_EEEESA_EEE9_M_invokeERKSt9_Any_data ()
#5  0x00000000008c736b in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()()>*, bool*) ()
#6  0x0000003d7200ce03 in pthread_once () from /lib64/libpthread.so.0
#7  0x00000000005e527b in std::__future_base::_State_baseV2::_M_set_result ()
#8  0x0000000000638b0a in _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJZN7octopus12_GLOBAL__N_13runENS6_4TaskENS5_23ContigCallingComponentsERNS6_16CallerSyncPacketEEUlvE_EEEENS6_13CompletedTaskEEC4EOSD_EUlvE_EEEEE6_M_runEv.lto_priv.0 ()
#9  0x00002b21cf468bd0 in execute_native_thread_routine (__p=0x7aa3ee0) at ../../../../../gcc-9.1.0-src/libstdc++-v3/src/c++11/thread.cc:80
#10 0x0000003d72007aa1 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003d71ce8c4d in lseek64 () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

So, both segfaults seem follow similar paths until:

octopus::VcfRecordFactory::make(std::vector<octopus::CallWrapper, std::allocator<octopus::CallWrapper> >&&) const ()

then, sometimes they have:

bool octopus::operator==<octopus::GenomicRegion>(octopus::BasicAllele<octopus::GenomicRegion> const&, octopus::BasicAllele<octopus::GenomicRegion> const&) ()

but most times they don't-- but both end with:

__memcmp_sse4_1 () from /lib64/libc.so.6

I agree with you that this seems compiler related, and possibly a libc issue.

Would you mind posting the exact version numbers you are using for the following?

My version of libc is 2.12

Thanks!

Alex

dancooke commented 4 years ago

Hi @ahstram, thanks for reporting back.

Looks like this is going to be a tricky one! Are you able to share input data that's triggering this bug? If I can reproduce on my end then it will be a lot easier to track the root cause down. If not, can you try rebuilding but using DCMAKE_BUILD_TYPE=RelWithDebInfo? If successful, you'll get the binary octopus-sanitize that you can use in place of octopus. This will hopefully give a much more detailed stacktrace. If this doesn't help then I'll probably need to create a new branch with some additional logging for you to try.

Thanks Dan

ahstram commented 4 years ago

@dancooke

I got a sanitized build of Octopus done using -DCMAKE_BUILD_TYPE=RelWithDebInfo, while also having to add -Wno-uninitialized to the CXX_FLAGs...

What version of GCC/Boost/libc are you using? It seems odd to me that I need to explicitly turn off some of these warnings.

I've launched another round of 50 tests using a dataset known to cause this issue, I'll report back to you with the stack trace if/when one fails. It may be a few days though, as I need to keep these tests behind some higher priority stuff in our compute queue.

It may be a bit tricky to get you an example dataset-- the current dataset I have is private. In the past, I've got around this by subsetting the data to a small enough region that we can release publicly. Unfortunately, this problem is intermittent enough that subsetting the problematic dataset to a smaller problematic dataset may be difficult to do.

Also, I'm not sure if this issue is just randomly occurring across all input datasets, or only intermittently occurs on certain datasets. I'll try to replicate the issue using a public data set, but that will come after I get you the sanitized stack trace.

Alex

ahstram commented 4 years ago

@dancooke Yikes, one of my test runs did get through on the queue, but failed out immediately with:

==193777==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
==193777==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v

Doing a little searching, I found this: https://github.com/google/sanitizers/wiki/AddressSanitizer

Q: On Linux I am seeings a crash at startup with something like this ERROR: AddressSanitizer failed to allocate 0x400000000 (17179869184) bytes at address 67fff8000 (errno: 12) A: Make sure you don't have 2 in /proc/sys/vm/overcommit_memory

It looks like our HPC environment does have "2" set in /proc/sys/vm/overcommit_memory, so it may not be possible to run the octopus-sanitize binary for this round of testing.

dancooke commented 4 years ago

@ahstram

Ok thanks for trying.

I've created a new branch fix/issue-93 with some additional logging information. The new debug logging will output to stdout, so please can you pipe the output to a file. If you can also request a standard debug log --debug=[log filename] for each run then that would be helpful.

Dan

ahstram commented 4 years ago

@dancooke I appreciate your help very much on this matter, but I suspect that the best course of action is to just close the issue for now. I have not experienced the segmentation fault since November 25th, and it may have gone away for good when I fixed an LD_LIBRARY_PATH issue with my runtime environment.

Thanks for your efforts!