Closed naumenko-sa closed 3 years ago
Hi @naumenko-sa ,
PEPPER-DeepVariant is designed for diploid genomes only or human genomes. The way it works: 1) PEPPER SNP finds SNP, it allows at most TWO mismatch bases to be picked up. And reports them as a set of candidate SNPs. 2) DeepVariant takes those SNPs and increases the precision to get a high-quality set of SNPs. It sees only bi-allelic variants. 3) WhatsHap takes the SNP set and phases the bam. 4) PEPPER HP finds candidates, again allowing at most two alleles per site expecting the sample to be diploid. 5) DeepVariant takes the phased BAM and the output from PEPPER HP to increase the precision of the variant call.
In this setting, unfortunately, it's not possible to have triploid or tetraploid variant calls. But what you can do is this: 1) Phase the BAM using WhatsHap polyploid command. 2) Split the BAM by the HP tag, add reads with HP-0 to all of the other ones as WhatsHap was unsure which bin they belong to. 3) Run PEPPER-DeepVariant on each HP bam separately and filter the VCF for 1/1 or 2/2 calls or homozygous-alt calls. 4) Merge all of the VCF together using bcftools concat and that should possibly give you a good answer.
In this scenario, please know that it's expected per haplotype to have at least 30-40x coverage. So a triploid genome, it would be ideal to have 90x-120x coverage. I can't say how well this will work, as we have not benchmarked it internally, but we usually apply the same logic for effectively haploid samples like CHM13 and see that it works.
Thanks for the explanation, @kishwarshafin !
Unfortunately, WhatsHap requires a multi-allelic VCF to do polyploid phasing. I was hoping to get that from PEPPER after I was not lucky with WhatsHap's internal caller. https://whatshap.readthedocs.io/en/latest/guide.html#polyploid-phasing
I see. Now that I think of this, I don't think it's that difficult to add tri and tetraploid SNP finding with PEPPER. Currently, I am looking for P(SNP_candidate| diploid_prediction) in PEPPER SNP, it has to simply change to P(SNP_candidate | base_prediction). However, my current engagements will be the limiting factor here. I can't say when we can get to this.
Thanks, please update this issue if you ever have a chance to push the multiallelic use case!
There are several potential applications of multiallelic variant calling:
@naumenko-sa ,
Thank you for pointing out the possible use-cases. I have created an internal issue with a mid-priority level. As this is a long-term goal, I'd close this issue. If we get to this, someone from our team will update this issue.
Hello!
Thanks for releasing and supporting pepper - such a useful tool!
I've run variant detection according to this manual for Plasmodium falciparum reference genome and ONT reads.
I know that there are 3 haplotypes in the sample, but the majority of variants from PEPPER were bi-allelic. For example, a clearly triallelic site with A/C/G supported by 32/148/82 reads, was called as bi-allelic: Pf3D7_08_v3 736271 . G C 38.4 PASS . GT:GQ:DP:AD:VAF:PL 0/1:34:245:72,143:0.583673:38,0,35
Tri-allelic calls were few, just 0.3% of PASS calls, and they were mostly low coverage variants like
Is there a flag to force pepper to call multiallelic variants?
Thanks! Sergey