google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.2k stars 722 forks source link

Issues merging gVCFs with GLnexus #633

Closed Dani-kolbe closed 1 year ago

Dani-kolbe commented 1 year ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.5/docs/FAQ.md: Yep

Describe the issue: The issue is actually with GLnexus, however noone is replying there (people with same issues), and since it is recommended here in the deepvariant pipeline, I thought I would ask here.

I have produced a large set of gVCF files using deepvariant and merged these with GLnexus as recommended. However, there seems to be an issue in the final merged VCF file. Many genotypes are called as 0/0 when they have very low or zero DP: e.g. 1 1319056 1_1319056_A_G A G 51 . AF=0.32848;AQ=51 GT:DP:AD:GQ:PL:RNC 0/0:0:0,0:1:0,0,0:.. 1/1:2:0,2:3:29,6,0:.. 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II 0/1:2:0,2:1:12,2,0:.. 0/0:0:0,0:1:0,3,29:.. 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 1/1:5:0,5:9:40,12,0:.. 1/1:3:0,3:7:36,10,0:.. 0/0:0:0,0:1:0,3,29:.

This is messing with downstream analysis, and overall just looks like poor QC. Additionally, the annotation/filter field is missing. In the gVCFs there was still a "PASS" label. This is also required for downstream analysis. So I am wondering where I went wrong, or whether there is a more suitable software to merge gVCFs. Thank you!

Setup

Steps to reproduce:

This is the vcf header:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##GLnexusVersion=v1.4.1-0-g68e25e5
##GLnexusConfigName=DeepVariant
##GLnexusConfigCRC32C=2932316105
##GLnexusConfig={unifier_config: {drop_filtered: false, min_allele_copy_number: 1, min_AQ1: 10, min_AQ2: 10, min_GQ: 0, max_alleles_per_site: 32, monoallelic_sites_for_lost_alleles: true, preference: common}, genotyper_config: {revise_genotypes: true, min_assumed_allele_frequency: 9.99999975e-05, snv_prior_calibration: 0.600000024, indel_prior_calibration: 0.449999988, required_dp: 0, allow_partial_data: true, allele_dp_format: AD, ref_dp_format: MIN_DP, output_residuals: false, more_PL: true, squeeze: false, trim_uncalled_alleles: true, top_two_half_calls: false, output_format: BCF, liftover_fields: [{orig_names: [MIN_DP, DP], name: DP, description: "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read depth (reads with MQ=255 or with bad mates are filtered)\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}, {orig_names: [AD], name: AD, description: "##FORMAT=<ID=AD,Number=R,Type=Integer,Description=\"Allelic depths for the ref and alt alleles in the order listed\">", type: int, number: alleles, default_type: zero, count: 0, combi_method: min, ignore_non_variants: false}, {orig_names: [GQ], name: GQ, description: "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">", type: int, number: basic, default_type: missing, count: 1, combi_method: min, ignore_non_variants: true}, {orig_names: [PL], name: PL, description: "##FORMAT=<ID=PL,Number=G,Type=Integer,Description=\"Phred-scaled genotype Likelihoods\">", type: int, number: genotype, default_type: missing, count: 0, combi_method: missing, ignore_non_variants: true}]}}
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency estimate for each alternate allele">
##INFO=<ID=AQ,Number=A,Type=Integer,Description="Allele Quality score reflecting evidence for each alternate allele (Phred scale)">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##FILTER=<ID=MONOALLELIC,Description="Site represents one ALT allele in a region with multiple variants that could not be unified into non-overlapping multi-allelic sites">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=RNC,Number=2,Type=Character,Description="Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype Likelihoods">
tedyun commented 1 year ago

Hello,

I think it'd be helpful if you can share individual gVCF calls for the problematic site before merging (hopefully with a fewer number of samples) so that we can try to reproduce it. As you said we'll have to ask GLnexus maintainers about this issue. I found this issue on the GLnexus repo https://github.com/dnanexus-rnd/GLnexus/issues/286 - I'd recommend adding a reproducible example there as well.

About the PASS filter, if your downstream application explicitly requires having that filter value, I'd recommend using tools like bcftools to add it. I think bcftools annotate --rename-annots would work based on this page, using FILTER/. PASS as the mapping: https://samtools.github.io/bcftools/bcftools.html#annotate

Thank you.

Best, Ted

Dani-kolbe commented 1 year ago

Hi Ted, thank you for your reply! This is what the site looks like in a couple of gvcf files directly from deepvariant:


1   1318679 .   A   <*> 0   .   END=1319106 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0
1   1319056 .   A   G,<*>   30.7    PASS    .   GT:GQ:DP:AD:VAF:PL  1/1:8:2:0,2,0:1,0:29,6,0,990,990,990
1   1318679 .   A   <*> 0   .   END=1319087 GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0

Thank you for your help! All the best, Daniel

AndrewCarroll commented 1 year ago

Hi @Dani-kolbe

Just to understand your desired behavior, do you want the final VCF calls for low evidence sites to be uniformly ./., or do you prefer all ./. listed as 0/0?

Dani-kolbe commented 1 year ago

Hi Andrew, thank you again for your reply. I think ./. is fine - this will be handled as missing data by something like PLINK? And DV will filter out site with overall low confidence, right? Best wishes, Daniel

pichuan commented 1 year ago

I'm a bit uncertain about the status of this thread. It also seems like there was another thread asking about filtering.

@Dani-kolbe Is there any questions we can still help with in this thread?

Dani-kolbe commented 1 year ago

Hi. Everything is fine thanks! closing thread, sorry