Closed sivico26 closed 3 years ago
Hi, I think the problem is that your data consists of 4 samples:
[2021-06-25 23:24:08] <INFO> Detected 4 samples: "Glu1" "Glu1u" "Glu2" "Glu2u"
but you're explicitly invoking the individual
calling model and not specifying which sample to call. You should either specify which sample to call with the --samples
option, or use the population
calling model if you want to call all 4 samples jointly.
All right, I tested both options:
Individual
First, I added the samples to samples.txt
, which looked like this:
Glu1
Glu1u
Glu2
Glu2u
And modified the command line to this:
octopus --threads $threads -R $ref -I $mapped_reads -s $folder/samples.txt --bamout $folder/output/realigned_mapped.bam --data-profile variation.prof -C individual -z 0.018 --forest-model $forest -o $vars
This specification ended up like the first one. I monitored it and took around 5 min to find all the errors and finally stalled. It is a zombie since then (workload at ~0 but the different processes are still running). Here is the log:
Population I changed to the population model and modified the command line to:
octopus --threads $threads -R $ref -I $mapped_reads -s $samples --bamout $folder/output/realigned_mapped.bam --data-profile variation.prof -C population -z 0.018 --forest-model $forest -o $vars
This one is running without problems so far (1.5 hours at the time of writing) and is at 58% of completion. There are no problems of the sort "_Map_base::at" like on the other runs.
Thoughts I am glad we could make it run with the population model. However, in my mind, it is not the appropriate one. These samples are different runs of the same individual. The differences between samples should be biases or errors associated with the different libraries. I am worried that these biases can be considered natural variations of "different" individuals in the population model. In this sense, I see the individual model as more strict and more appropriate.
Is there a way to trick Octopus to use all the samples with the individual model? I thought of putting the same tag to all the libraries as an option, but I fear that missing this information could result in Octopus seeing the biases as natural variation as I described for the population model.
What do you think? Is my understanding of the different models and how they use the samples' information correct or I am missing something?
What are you trying to achieve with the replicates - are you actually interested in identifying differences between libraries or are you just looking to increase calling power by adding more data? If the latter, then you should use the same SM
SAM tag for each replicate (the RG
ID
s can still be unique). Note that Octopus merges read-groups with the same SM
tag, and currently, all read groups are treated identically.
Hi again,
Indeed, the latter is the case I am interested in. Changing the SM
tags made the trick, and Octopus ran perfectly. Thank you for your help.
As final remarks, I think it might be useful to mention the SM
tag on the documentation and to see why Octopus was stalling silently instead of crashing after reporting the errors.
Keep the good work. Regards, Sivico
Hi there,
Describe the bug I am running octopus in a small genome (43 Mb) with a fairly small amount of data (details below). I think Octopus has stalled after some errors of the like
Encountered a problem whilst calling JH711484.1:2037663-2641344(_Map_base::at)
(see log for full details). A clear and concise description of what the bug is.Version
Command Command-line to install octopus:
Command line to run octopus:
Where:
$threads
-> 32.$ref
-> Strain G.260125-1 of the fungus Ganoderma lucidum.$mapped_reads
-> a 1.1Gb sorted bam aligned to reference with BWA with a couple of libraries of Illumina ISeq.$forest
-> germline.v0.7.4.forest from GoogleCloud.Additional context Octopus has been running one day and 18 hours by now. I do not expect Octopus to take this long for such a small dataset. Looking at the log most of the errors occurred early on the execution. When I enter the cluster the workload is barely above 0, which means that octopus is not computing.
What do you think is happening? Is there a way I can avoid the issue?
Thanks a lot for your thoughts, Regards
result_compute-0-10_1635127.err.txt