luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Possible stalling after "_Map_base::at" errors #193

Closed sivico26 closed 3 years ago

sivico26 commented 3 years ago

Hi there,

Describe the bug I am running octopus in a small genome (43 Mb) with a fairly small amount of data (details below). I think Octopus has stalled after some errors of the like Encountered a problem whilst calling JH711484.1:2037663-2641344(_Map_base::at) (see log for full details). A clear and concise description of what the bug is.

Version

$ octopus --version
octopus version 0.7.4
Target: x86_64 Linux 5.4.0-72-generic
SIMD extension: AVX2
Compiler: GNU 9.3.0
Boost: 1_74

$ conda list
octopus                   0.7.4                h40a150c_0    bioconda

Command Command-line to install octopus:

$ conda install -n Octopus -c bioconda octopus samtools bcftools bwa

Command line to run octopus:

$ octopus --threads $threads -R $ref -I $mapped_reads --bamout $folder/output/realigned_mapped.bam --data-profile variation.prof -C individual -z 0.018 --forest-model $forest -o $vars

Where:

Additional context Octopus has been running one day and 18 hours by now. I do not expect Octopus to take this long for such a small dataset. Looking at the log most of the errors occurred early on the execution. When I enter the cluster the workload is barely above 0, which means that octopus is not computing.

What do you think is happening? Is there a way I can avoid the issue?

Thanks a lot for your thoughts, Regards

result_compute-0-10_1635127.err.txt

dancooke commented 3 years ago

Hi, I think the problem is that your data consists of 4 samples:

[2021-06-25 23:24:08] <INFO> Detected 4 samples: "Glu1" "Glu1u" "Glu2" "Glu2u"

but you're explicitly invoking the individual calling model and not specifying which sample to call. You should either specify which sample to call with the --samples option, or use the population calling model if you want to call all 4 samples jointly.

sivico26 commented 3 years ago

All right, I tested both options:

Individual

First, I added the samples to samples.txt, which looked like this:

Glu1
Glu1u
Glu2
Glu2u

And modified the command line to this:

octopus --threads $threads -R $ref -I $mapped_reads -s $folder/samples.txt --bamout $folder/output/realigned_mapped.bam --data-profile variation.prof -C individual -z 0.018 --forest-model $forest -o $vars

This specification ended up like the first one. I monitored it and took around 5 min to find all the errors and finally stalled. It is a zombie since then (workload at ~0 but the different processes are still running). Here is the log:

log_indiv_w_samples.txt

Population I changed to the population model and modified the command line to:

octopus --threads $threads -R $ref -I $mapped_reads -s $samples --bamout $folder/output/realigned_mapped.bam --data-profile variation.prof -C population -z 0.018 --forest-model $forest -o $vars

This one is running without problems so far (1.5 hours at the time of writing) and is at 58% of completion. There are no problems of the sort "_Map_base::at" like on the other runs.

Thoughts I am glad we could make it run with the population model. However, in my mind, it is not the appropriate one. These samples are different runs of the same individual. The differences between samples should be biases or errors associated with the different libraries. I am worried that these biases can be considered natural variations of "different" individuals in the population model. In this sense, I see the individual model as more strict and more appropriate.

Is there a way to trick Octopus to use all the samples with the individual model? I thought of putting the same tag to all the libraries as an option, but I fear that missing this information could result in Octopus seeing the biases as natural variation as I described for the population model.

What do you think? Is my understanding of the different models and how they use the samples' information correct or I am missing something?

dancooke commented 3 years ago

What are you trying to achieve with the replicates - are you actually interested in identifying differences between libraries or are you just looking to increase calling power by adding more data? If the latter, then you should use the same SM SAM tag for each replicate (the RG IDs can still be unique). Note that Octopus merges read-groups with the same SM tag, and currently, all read groups are treated identically.

sivico26 commented 3 years ago

Hi again,

Indeed, the latter is the case I am interested in. Changing the SM tags made the trick, and Octopus ran perfectly. Thank you for your help.

As final remarks, I think it might be useful to mention the SM tag on the documentation and to see why Octopus was stalling silently instead of crashing after reporting the errors.

Keep the good work. Regards, Sivico