google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.2k stars 721 forks source link

As bed lengthened, SNP performed better, but indel on the contrary #616

Closed zxy1555847 closed 1 year ago

zxy1555847 commented 1 year ago

Hello! I run the rawdata of NA12878 download from NCBI SRA []() and I got it's capture kit is Agilent_V5. First, I run the oqfe protocol to align, and the output CRAM as the input of Deepvariant. I run Deepvariant in WES model 3 times, the first one didn't have --region parameter, the second one use a adding 50 bp buffer on each side of the custom target regions in BED format, the last one is adding 100 bp. Next, I got the truth Benchmarking variant calls form GIAB and it's confident call regions to run hap.py. The final outcome is very good, but I find a detail didn't make sense: as the bed lengthened,the SNP performed better and better, but INDEL on the contrary that it's getting worse since the number is decreasing, but I think it is making sense that the number becomes more as the bed gets longer, just like SNP. As shown in the figure below. image Can you give me a detailed explanation of this detail? Thank you very much! Finally, thank you very much for developing such a great tool!

MariaNattestad commented 1 year ago

You're very welcome :) We can speculate, but it might just be sort of random things happening on the edges of the regions due to low numbers of errors overall. If you look at the number of errors different between these runs, it's like 26 vs 24 false positives for indels. I wouldn't draw any conclusions of trends from such small numbers. If you're curious you could inspect them in IGV though. I don't think we have any more insight into this from the DeepVariant side than you do.

zxy1555847 commented 1 year ago

Thanks for your respond! If it's a random event, I think I can accept it.

AndrewCarroll commented 1 year ago

Hi @zxy1555847

@MariaNattestad is correct that there are only a small number of errors, so it is hard to definitively tell you what is going on.

However, one thing I want to point out is that in general for exome sequencing, we expect for all analysis methods, accuracy will start dropping outside of the capture ranges with an increasing amount the farther we go from the capture. We also expect Indel to be affected more than SNP.

The reasons for this is that sequence coverage begins to drop toward the boundaries of the capture (the amount of this drop depends on the particular capture and the sequence context around it, but on average it will be the case). In general, lower coverage will mean lower accuracy, but we observe that coverage has a larger effect on Indels than SNPs (this is detailed in our Extensive sequence dataset paper. The reasons that are complex (though if you want me to further elaborate, I can try).

In short, Indel accuracy dropping outside of capture regions is expected to some extent, and this is a function of the underlying sequencing method as opposed to the analysis method.

zxy1555847 commented 1 year ago

Thank you very much for your kind and detailed reply. I now fully understand the nature of the problem.