google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.13k stars 703 forks source link

Why no variants called in these regions? #233

Closed ydLiu-HIT closed 4 years ago

ydLiu-HIT commented 4 years ago

Hi, I was testing the variants calling by deepvariant (DV) using PacBio CCS long reads https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/PacBio_SequelII_CCS_11kb/HG002.SequelII.pbmm2.hs37d5.whatshap.haplotag.RTG.10x.trio.bam. But I found some issues in the region 22:16,977,867-16,978,040 (hs37d5) show in the following IGV screenshot. fig1

In the region, DV can call SNPs at position 16977891(A->G) and 17977975(A->G). A strange question, all the variants come out simultaneously in some reads at pos 16977870(A->T), 16977879(7bp insertion), 16977911(G->A), 16977924(T->C), 16977984(A->G) and 16978027(2bp deletion), the read counts support reference and variant allele is 15:12. What's more, I also test GATK4, both DV and GATK4 don't give these variants. So I doubt that why DV doesn't give these positions as variants, if so, are there some tips for filtering these positions? Looking forward to your answer.

Best!

AndrewCarroll commented 4 years ago

Hi @ydLiu-HIT

Thank you for your question. The answer to this is complicated. It looks like the region that these elements is in is a LINE element - long regions with multiple copies through the genome that have high sequence similarity to each other.

Because of the high sequence similarity, reads to line elements can map to other parts of the genome, and they are generally very difficult regions to call correctly.

We've seen the behavior in DeepVariant not calling variants that are near other variants and in regions with two (or more) variant-rich haplotypes. We think that one of the reasons for this is that DeepVariant has learned that these regions represent uncaptured segmental duplication and LINE elements, which are often labelled as not variant in the more comprehensive genome in a bottle truth set.

Whether these positions represent true variants at that position, or sequences from a similar LINE element elsewhere is difficult to say. Since this is HG002, if this is within the confident regions, you can see whether Genome in a Bottle indicates them to be true variants. However, Genome in a Bottle has some more recent corrections to variants in/near LINE elements, so it may be better to check the updated (though still beta) truth set

DeepVariant will output every candidate considered, so if you want to find positions that are called in this way, looking for 0/0 or ./. calls with more than 35% ALT support and within 100bp of 2 or more candidate variants may be able to pull out many of these examples.

The other option to pull out examples like this would be to intersect with a LINE element annotation track from UCSC.

Please let us know if there is something unclear about this answer. This is a rather complicated concept and explanation.