Closed bredelings closed 3 years ago
You can do a second pass using only passing variants from the first pass, like:
$ octopus -C polyclone -R ref.fa -I reads.bam -o pass1.vcf.gz
$ octopus -C polyclone -R ref.fa -I reads.bam --disable-denovo-variant-discovery --source-candidates pass1.vcf.gz -o pass2.vcf.gz
Also, are you using a PCR-based library prep? If so you should specify the sequence error model, e.g., --sequence-error-model PCR
(in both passes). This should reduce the number of false positive homopolymer indel errors.
Additionally, you might want to try random forest filtering using the pre-trained germline forest. This hasn't been trained on microbial data but it may still be better than your hard-coded filter.
I just managed to get around to trying --sequence-error-model=pcr
last weekend. It seems like it basically fixed the problems -- very nice! Now, if I use a maximum of 4 clones (which might be overkill) it is still possible to get large regions with one 1 haplotype in cases where is only 1 strain. For the cases where there are two or more strains I do indeed see two haplotypes.
As you say, Illumina isn't supposed to have issues with the lengths of mononucleotide repeats -- I guess I was confusing that with another sequencing technology like Ion Torrent or something.
Describe the bug
I was interested in using polyclone mode to determine the true number of alleles (and their frequency) in malaria samples.
However, it seems that octopus sometimes finds too many alleles because illumina has problems with mono-nucleotide repeats. For example, in samples that are not actually mixtures, I see calls like:
Here the reference has 14 A's beginning at 183312.
I am curious if it would be possible for octopus to implement a second pass that attempts to merge haplotypes that differ only because of the number of mononucleotide repeats.
So far I have limited the number of haplotypes to 2, on the theory that having more than 2 alleles is unlikely. This avoids inferring 3 or 4 haplotypes, but allows inferring 2 when the real number is 1.
I suppose it would also be possible to implement a post-processing past outside of octopus that attempts to model the illumina error process for mononucleotide repeats, and attempt to merge haplotypes.
Version
Command line to run octopus:
chr1.poly2.vcf.gz chr1.poly3.vcf.gz chr1.poly4.vcf.gz