luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

Polyclone mode and mono-nucleotide repeats #153

Closed bredelings closed 3 years ago

bredelings commented 3 years ago

Describe the bug

I was interested in using polyclone mode to determine the true number of alleles (and their frequency) in malaria samples.

However, it seems that octopus sometimes finds too many alleles because illumina has problems with mono-nucleotide repeats. For example, in samples that are not actually mixtures, I see calls like:

LT635612        183312  .       T       TA,*    1177.5  AF      AC=1,1;AN=4;DP=400;MQ=59;NS=1   GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      2|0|1|0:1177:400:59:183312:100:48,120.87,131.548,169.223:0.1,0.26,0.28,0.36:1315,233,77:AF
LT635612        183312  .       TA      T,*     1177.5  AF      AC=1,1;AN=4;DP=402;MQ=59;NS=1   GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      2|1|0|0:1177:402:59:183312:100:48,120.87,131.548,169.223:0.1,0.26,0.28,0.36:1173,304,77:AF
LT635612        183312  .       TAA     T,*     1177.5  AF      AC=1,1;AN=4;DP=402;MQ=59;NS=1   GT:GQ:DP:MQ:PS:PQ:HPC:MAP_HF:AD:FT      1|2|0|0:1177:402:59:183312:100:48,120.87,131.548,169.223:0.1,0.26,0.28,0.36

Here the reference has 14 A's beginning at 183312.

I am curious if it would be possible for octopus to implement a second pass that attempts to merge haplotypes that differ only because of the number of mononucleotide repeats.

So far I have limited the number of haplotypes to 2, on the theory that having more than 2 alleles is unlikely. This avoids inferring 3 or 4 haplotypes, but allows inferring 2 when the real number is 1.

I suppose it would also be possible to implement a post-processing past outside of octopus that attempts to model the illumina error process for mononucleotide repeats, and attempt to merge haplotypes.

Version

This file was made a while ago, with version=0.7.0_develop_b83ce113

Command line to run octopus:

$ octopus -I ERR2679006.bam -R /home/br51/malaria/reference/plasmo-combined.fasta -T LT635612 -o chr1.poly4.vcf.gz --bamout chr1.poly4.bam --bamout-type FULL --fi
lter-expression QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 | AD < 1 --annotations AD -C polyclone --max-clones 4 --threads 10

chr1.poly2.vcf.gz chr1.poly3.vcf.gz chr1.poly4.vcf.gz

dancooke commented 3 years ago

You can do a second pass using only passing variants from the first pass, like:

$ octopus -C polyclone -R ref.fa -I reads.bam -o pass1.vcf.gz
$ octopus -C polyclone -R ref.fa -I reads.bam --disable-denovo-variant-discovery --source-candidates pass1.vcf.gz -o pass2.vcf.gz

Also, are you using a PCR-based library prep? If so you should specify the sequence error model, e.g., --sequence-error-model PCR (in both passes). This should reduce the number of false positive homopolymer indel errors.

Additionally, you might want to try random forest filtering using the pre-trained germline forest. This hasn't been trained on microbial data but it may still be better than your hard-coded filter.

bredelings commented 3 years ago

I just managed to get around to trying --sequence-error-model=pcr last weekend. It seems like it basically fixed the problems -- very nice! Now, if I use a maximum of 4 clones (which might be overkill) it is still possible to get large regions with one 1 haplotype in cases where is only 1 strain. For the cases where there are two or more strains I do indeed see two haplotypes.

As you say, Illumina isn't supposed to have issues with the lengths of mononucleotide repeats -- I guess I was confusing that with another sequencing technology like Ion Torrent or something.