luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
305 stars 38 forks source link

Cancer caller hangs waiting on a futex #204

Open lordkev opened 3 years ago

lordkev commented 3 years ago

Describe the bug

Cancer caller hangs in a reproducible manner. The caller is started with 24 threads. Slowly the number of active threads appears to dwindle as I can see the CPU usage go down. Eventually only one thread appears to be running as CPU usage is down to ~100% compared to ~2400% at the beginning. After some hours the one remaining thread appears to no longer be active and when connecting to the process via strace it's waiting on a futex.

Version

$ octopus --version
octopus version 0.7.4 (ed012a6e)
Target: x86_64 Linux 5.4.0-73-generic
SIMD extension: AVX2
Compiler: GNU 11.2.0
Boost: 1_76

Command Command line to install octopus:

$ git clone https://github.com/luntergroup/octopus.git
$ octopus/scripts/install.py --dependencies --forests

Command line to run octopus:

/bio/tools/octopus/bin/octopus-0.7.4 -R /bio/ref/hs37d5.fa -I tumor_chr19.bam normal_chr19.bam -o octopus-somatic-4hapmax.chr19.vcf.gz --normal-sample normal --bamout oct_0.7.4-bams-chr19 --forest /bio/tools/octopus/resources/forests/germline.v0.7.4.forest --somatic-forest /bio/tools/octopus/resources/forests/somatic.v0.7.4.forest --max-somatic-haplotypes 4 --normal-contamination-risk LOW --regions 19 --debug octopus_0.7.4_chr19.log --threads 24

Additional context The tumor bam in this case is a synthetic mixture of data that I created by subsetting reads from bams of two germline samples. However all other chromosomes aside from chr19 complete without issue. I believe this might be related to the report by @jbedo at the end of #150 as they mention very similar behavior of waiting on a futex.

dancooke commented 3 years ago

Hi, thanks for the bug report. Please can you provide the output of stdlog?

lordkev commented 3 years ago

Sure, may be a bit as I'll have to run it again. I didn't notice much of interest that looked any different than the rest of the run.

lordkev commented 3 years ago

Hi Dan,

Here is the log file. By this point there was no CPU activity and it appeared to have just deadlocked again.

stdout.log

dancooke commented 3 years ago

Thanks - you're hitting an error/bug:

[2021-08-31 19:14:40] <EROR> Encountered a problem whilst calling 19:23213199-23954187(HaplotypeTree::prune_unique called with matching Haplotype not in tree)

There's also a lot of haplotype skipping which suggests more general problems. Could you can provide the BAMs?

lordkev commented 3 years ago

Ah, sorry I totally missed that - was looking at the bottom of the log. I just emailed you a GDrive link to the BAMs and further explanation.

dancooke commented 3 years ago

Thanks for the BAMs - I'm working on replicating the bug. Could you give a bit more information on how these reads were generated? I'm seeing lot of spliced reads in both the normal and tumour samples that appear like deletions (most 168bp). Here's an IGV pileup showing this, where I've also realigned reads from the normal to called haplotypes:

igv_realigned

This is the cause of the many skipped regions and probably why the bug is being triggered.

lordkev commented 3 years ago

Hmm, interesting. The normal sample is the NA12892 platinum genome and it was mapped using BWA-MEM. The only thing a bit out of the norm that I can think of is that read merging was enabled during adapter trimming, though only ~4% of reads were merged.

jbedo commented 2 years ago

Late to this thread but I am also running into this. I'm working on the devel branch atm and consistently have threads getting stuck, and seeing error messages in my logs:

[2022-01-20 22:45:25] <EROR> Encountered a problem whilst calling chr5:1341945-1516474()