google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.19k stars 721 forks source link

output left aligned variant representation for indels #487

Closed edg1983 closed 3 years ago

edg1983 commented 3 years ago

Hello,

I want to suggest here to left-aligne indel in the DeepVariant output VCF / gVCF to avoid the issue described below.

I'm running DeepVariant v1.1.0 on a set of samples sequenced with Illumina 2x150 paired-end reads. My workflow right now includes calling variant using DV and then merge individual gVCFs using GLnexus as described in your best-practices for multi-sample VCF.

Inspecting the resulting cohort VCF I've noticed that the representation of indels in repetitive / homopolymer regions is not normalized to the leftmost position and this generates odd situations downstream. Essentially, the multi-sample VCF, would contain 2 different variants that, when left-aligned downstream using for example bcftools norm, become the same locus generating duplicated vars with different genotypes. I didn't notice this issue with recent versions of GATK so I suppose they left-align indels in the output VCF. See the example below:

These are 2 indel variants in my multi-sample VCF:

chr3    105259621       chr3_105259621_T_TTA    T       TTA
chr3    105259623       chr3_105259623_A_ATA    A       ATA

As you can see in the screenshot, the actual locus is a repetitive region with TA repeats, so the exact location of a TA insertion in the stretch can not be known. image

When I apply bcftools norm, it changes the second one to the leftmost position, making it identical to the first one (which is the expected behavior). So in the end I have 2 duplicated vars in my VCF, each with different genotypes:

chr3    105259621       chr3_105259621_T_TTA       T       TTA
chr3    105259621       chr3_105259623_A_ATA       T       TTA

This situation creates troubles for downstream analysis and segregation, even if probably most of these variants can be discarded since they are likely artifacts. The problem does not affect many single allele variants (just 51 out of 24054518 in my dataset), but it affects lot of the multi-allelic ones.

If indels were leftaligned before output, this would solve the issue I think and likely many multi-allelic will become single-allele. Any plan for this in the future?

Thanks!

pichuan commented 3 years ago

Hi @edg1983 , thanks for bringing up this issue! We have already been looking into this, and have already made a few internal fixes (done by @akolesnikov) that will be out in the next release. I'm closing this issue for now. Feel free to comment or reopen if you have more questions or suggestions.