lh3 / dipcall

Reference-based variant calling pipeline for a pair of phased haplotype assemblies
MIT License
96 stars 10 forks source link

Interpretation of N's in the VCF #14

Open mwykes opened 1 year ago

mwykes commented 1 year ago

I'm seeing N's in my dipcall v0.3 vcf where there are not Ns in the reference.

The variants lie within the high confidence bedfile.

Any idea what could be going wrong?

tabix prefix.dip.vcf.gz chr1:62784-64408
chr1    62784   .       G       A       30      .       .       GT:AD   1|1:0,2
chr1    62915   .       G       A       30      .       .       GT:AD   1|1:0,2
chr1    63003   .       C       T       30      .       .       GT:AD   1|1:0,2
chr1    63015   .       A       AAAT    30      .       .       GT:AD   1|1:0,2
chr1    64401   .       N       A       30      .       .       GT:AD   1|1:0,2
chr1    64402   .       N       C       30      .       .       GT:AD   1|1:0,2
chr1    64403   .       N       C       30      .       .       GT:AD   1|1:0,2
chr1    64404   .       N       A       30      .       .       GT:AD   1|1:0,2
chr1    64405   .       N       A       30      .       .       GT:AD   1|1:0,2
chr1    64406   .       N       T       30      .       .       GT:AD   1|1:0,2
chr1    64407   .       N       G       30      .       .       GT:AD   1|1:0,2
chr1    64408   .       N       C       30      .       .       GT:AD   1|1:0,2

samtools faidx ref.fa.gz chr1:62784-64408
>chr1:62784-64408
GACTGTTACAGGTTCCAGCAGGATAACTGGGTGGAAATGAGTTTGGTTTCACTTAGTCTC
TCTAAAGAGAAAGCAAGTTGGTAGACTAATACCTAATAAAAGCAAAGCTGCCAACAATTG
AAATTGCCTGGGCTGCTCTGTGTGTCCCACATGCATGGGTGTGGGTGCCAGTGTGTGTGC
GTGTGTGCATGCATGTGCATGTGTGTTGGGATAGAGTGGCAAGAAAATGGGAAATAAGAA
TGTTCAGTCCATAGCCCTTCATTATAAAAAGGTGAGCTGTAATAAATACTAGTGCCACAT
TTAGCCAAAACTTTACTCCAGCCAAAGGTGATATTTTCATGATAACATCCTGTGATTGCT
TTGTTCTTCGTCTTTTATGTTCTTCCTAGATGGGCTCAGAACATACAAGAATTAAGTACA
CATCTTATTTTCCAGTGATAATGCTACCGGCAAATTCTGTTGTTTGTATAAACATCAGCC
ATGTTTATATAACTAAACTAGTGTTTTGTTTTGTCAATTCAGCAAGAAATTAGACCACAT
GGTGGCTTAATGCTGCATTGATTTGGCTATCAATTTGTTTTCACTTTTCTGCAAAATATT
TAATACATTATTAAATTGAATTATGCTGATGCCACAGTTGTTCTTATCTCAATTGTCTTA
AAATTCATTTAATTTTTTTTTCCTTTGGTTTCATTATTCAAATTTTAACTTCAGTTCTCA
ACATTTTATCTGATGGAAGAGATGGAGTCCATTACTAAGGACTCCATTGTGCTCCATCAT
GCCAGAGTTGTAAAATAGATCTTTTAAAGGAAATTTACTGTGATTTTTTTCTATTTAAGA
GCTTCCTCTCCAGTTGAGCATGTAAGAAAATTATACCAGGAGAATACAGTAAACTCTATG
AGGCAAGCTATAAACATGTAGCATTGTGATTAGGGCTGGTTCTCCTTCTAGAGACATGGT
AGGATTGCAATTTCATACCATCCTTGAAGTTAGAGAGAGCCACGTGACTCATTTAGCCAA
TGAACTGTGAGCAGAATGACATGTCACTTCCAGCAGAAGCTTTAAGAATCTGAGAGACAT
TCATACGTTTTCCATGTGCTGTAGCCTTATACCCAAAGCCTGGGTCCCAAGTGACCATGA
CAGGCAGAGCTCCCTGTTGAGCCACAGAGATTTAGAGAATGGCTGTTAACACAGCATAAT
CCAGCCCATCCTGACTAATCTGATATTAACATGTATAATAAAGAATTCTATCAATGCTGA
GGGAAGATGATTAGTTAAGGTCCTAGGTTGCAAGTCTCAAAACCTCTTCTAAGGATTGTA
GACAGGAAATTAAATGACTTCTAGTCCCTAGAGTTCCCAATCTCCTACCATCCCATCCTA
ATATGACAGAAGTAATTCCTGAGTTGCTTCTGAAACCAGAGCTTCCCTCAGAACCCTTAG
CCTGCCAGATGGCTTCTTGGAGAGCCCTCACTCACTTTTCTCCTTCTGCTATTGCTGCTC
ATTCATTCCAGCTTTTAAAAATTCATCTTTATCCAGGAACCTCGCTTCTAGAAAAGTCAT
ACAGGTGCTTCCAGGAGGCTACATGGGCACCCATATTTTTCTAGCCACTTTCATTAGACC
AATGC

printf "chr1\t62784\t64408\n" | bedtools intersect -a prefix.dip.bed -wa -b -
chr1    2810    216558

Looking in IGV, the transition to the dense block of variants with ref Ns is not due to the start of a new alignment - and the variants with N as reference are not supported by the haplotype specific assembly to ref alignments.

igv_screenshot_dipcall

mwykes commented 1 year ago

Update on this - I just used an alternative tool to decode variants from prefix.hap1.bam to vcf and the resulting vcf looks fine (no Ns) and consistent with the alignments (see IGV screenshot). image

medaka tools consensus2vcf hap1.fa.gz ref.fa.gz --bam prefix.hap1.bam --out_prefix medaka_consensus2vcf.hap1

So it suppose this might point to an issue with htsbox pileup?