lh3 / dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies
MIT License
98 stars 10 forks source link

Missed diploid regions #16

Open zeeev opened 1 month ago

zeeev commented 1 month ago

Hi @lh3,

I'm building an assembly-based small variant truth set for NA12878, leveraging assemblies from across the CEPH pedigree and validating inheritance patterns in eight third-generation family members. The method identifies diploid regions consistently across the family, and I'm using Dipcall to generate variants within these high-confidence regions. However, a few issues have come up:

Missed heterozygous regions: Dipcall seems to miss clearly heterozygous regions in some diploid areas (see screenshot). This pattern leads to a ~10-fold increase in false-negative SNV calls compared to HiFi read-based variant calling (see attached xlsx). Is there a way to enable forced calling on all covered positions?

Alignment parameters: Are there recommended alignment settings to prevent splitting reads across small clusters of variants? Currently, I’m using the default parameters (see screenshot for example).

Suggestions for optimal variant calls: Any additional recommendations for improving small variant calling quality with Dipcall would be greatly appreciated.

Thanks very much for your insights!

--Zev

benchmark-against-giab.xlsx

Screenshot 2024-10-24 at 9 04 34 PM Screenshot 2024-10-25 at 9 16 21 AM
lh3 commented 1 month ago

Are these alignments generated by dipcall?

zeeev commented 1 month ago

The IGV alignments are based on:

command:minimap2 --secondary=no --eqx -a -t 2 -x asm5 version:2.21-r1071

zeeev commented 4 weeks ago

I've swapped out the alignments for those produced by dipcall, same results.

You can see we are missing around 10-fold more SNVs when comparing to DeepVariant against GIAB.

Screenshot 2024-10-28 at 11 14 34 AM Screenshot 2024-10-28 at 11 15 59 AM
lh3 commented 4 weeks ago

Ok, first of all, dipcall is developed for creating truth variants along with confident regions. It tries to be conservative such that calls in confident regions are of high quality. It is not intended for making the best variant calls without confident regions – that is less useful in practice.

For dense SNPs, you can align with -x asm20, but you will get more false SNPs.

The gap in the confident regions in the last screenshot is probably caused by another shorter alignment filtered in the bam. You can check whether there is an alignment in the PAF file outputted by dipcall (don't do alignment with your parameters). If there is one, you can tune dipcall but again, you will get more false SNPs as a result. PS: also check if the region is covered in prefix.hap1.bed and prefix.hap2.bed.

zeeev commented 4 weeks ago

Thanks for the explanation, @lh3,

One thing to consider: we have diploid coverage across the 8 children (asm5) and both parents, so it might be appropriate to consider relaxing the diploid criteria you're using.

Your point about specificity and truth variants is well received; however, in practice, the caller is being used more generally (as a general variation caller), so a more flexible approach might better support the use case.