google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.12k stars 703 forks source link

How to use DeepVariant for multisamples call snp? #680

Closed egnarora closed 11 months ago

egnarora commented 11 months ago

Hey there, I have done a separate call snp on individual bam using this tool under docker and subsequently merged the vcf files using GLnexus. But I would like to ask if I can do a group call snp on this group of individual bam at the same time? yours sincercely.

pgrosu commented 11 months ago

Hi @egnarora,

Your initial approach is the correct one. Since you have multiple samples, and DeepVariant only does single sample calling, it would not be advisable to merge your BAM files into a single sample BAM file. Regarding having a multi-sample BAM file, that would not work with DeepVariant as it only accepts one sample.

If in fact the whole group are the same sample, then yes you can merge, otherwise no.

Hope it helps, Paul

egnarora commented 11 months ago

Dear Paul, Thank you very much for your answer, but I still have a small question to ask you, is there a big difference between the result of this combination and the result of gatk group call? Will it have any impact on my subsequent analysis? For example, some individual missing SNPS will not exist in the merged vcf? With best wishes Cheng

pgrosu commented 11 months ago

Dear Cheng,

Yes, there can be some differences between DeepVariant-GLNexus (with optimization) and GATK-Joint genotyping. If you look at the following paper:

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

Using parameter-optimization of GLNexus (such as minimum quality thresholds, among others listed under Supplementary Table 4), the authors were able to get a slightly different number of SNPs than via GATK-Joint:

image

This is from Supplementary Figure 11 (A) found under Supplementary data, listed as a link on this page.

So merging with GLNexus for DeepVariant gVCF output files $-$ as @AndrewCarroll mentioned in a previous post $-$ by using Best practices for multi-sample variant calling with DeepVariant, it was found to be more accurate than using those gVCFs with GATK GenotypeGVCFs.

Regarding missing SNPs in individual samples, their genotype might get a no call (./.) as noted in this line of the GLNexus code:

Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele

Though it probably could also get called as homozygous reference, if all the QC pass.

Regarding impact on downstream analysis, probably the best bet is to try both approaches (DV-GLN-OPT and GATK-Joint) in parallel, and then validate both results.

Hope it helps, Paul

AndrewCarroll commented 11 months ago

Hi @egnarora

The way you are running DeepVariant (run on individual samples then genotype jointly with GLnexus) is correct and what we recommend. Thank you @pgrosu which is in agreement with the recommendation.

@egnarora some external groups have performed analysis on strategies which use more extensive joint calling processes with DeepVariant (for example, discovering all variants in a cohort and then experimentally performing force calling on candidate positions). Regeneron is one example of a group that has conducted this analysis. Their conclusion is that there are not variant calls which are missed in the individual process that can be recovered by the more extensive joint calling, and their conclusion was that the recommendation to use GLnexus will not result in missed variants that another approach would capture.

Hopefully this answers your question.

Thank you, Andrew

pgrosu commented 11 months ago

Hi Andrew,

Thank you for the nice words, and it's great to hear of the independent empirical confirmation!

It was fun going through the paper again, as the ideas became even better reinforced the second time around :)

Thank you, Paul

egnarora commented 11 months ago

The Honourable Andrew and Paul. Thank you both very much for your kind answers and advice, it will be very helpful for me in my next endeavours, Deepvariant is a very efficient and useful piece of software, thank you for all your hard work in making it available to us. I will follow your advice and read some other people's research papers.

Sincere thanks again. Cheng

pgrosu commented 11 months ago

Hi Cheng,

Thank you for the kind words, and I am happy to hear it was helpful for you! It was fun collective team effort :)

Feel free to drop by anytime if there is anything you need help with DeepVariant in the future.

Best of luck in your forthcoming analysis! Paul