luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Polyclone: change default --organism-ploidy to 1 #108

Closed maciejmotyka closed 4 years ago

maciejmotyka commented 4 years ago

Request The default value of --organism-ploidy is 2 for all callers. However, since

The polyclone calling model is designed for calling variants in a mixed haploid sample

it would make sense to use a default of 1 for this mode. Or at least include a reminder in the description of the polyclone mode.

Bonus question How does --organism-ploidy affect the polyclone mode?

I expected it to be silently ignored, but I noticed some AFB in my VCF, so the filtering still uses it to check the allele frequencies. ##FILTER=<ID=AFB,Description="The called allele frequencies are not as expected for the given ploidy">

Are there any other consequences if it's left at the default value?

Version

$ octopus --version
octopus version 0.7.0 (develop 5bf28f9c)
Target: x86_64 Linux 5.3.0-29-generic
SIMD extension: AVX2
Compiler: GNU 9.2.0
Boost: 1_72
dancooke commented 4 years ago

The --organism-ploidy option is silently ignored for the polyclone calling model. You can however use --max-clones to set the maximum "ploidy" of the sample.

Filtering is another issue. The polyclone model uses the --filter-expression filter expression, which defaults to "QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1". The default filter expressions don't change according to the calling model (although different calling models may use different filter expressions). So any ALT alleles with < 0.05 empirical frequency are filtered. You can of course set a different --filter-expression for threshold filtering, or even use a random forest filter if you have suitable training data.

maciejmotyka commented 4 years ago

Thank you for a quick reply. I searched through the code and it starts to make sense now. Please check if my understanding is correct:

The AFBs are produced by the AF filter and mean that the allele frequencies are < 0.05.

FILTER=

suggests that it takes into account the --organism-ploidy, but it doesn't and thus changing that parameter will not affect the filtering.

dancooke commented 4 years ago

You're right - the description is misleading. It would probably be better to have a have a new measure (e.g. AFB) that computes the deviation of the AF from the expected allele frequency given the ploidy.

maciejmotyka commented 4 years ago

I also think it might be useful. Thank you for clarifying everything. Closing.