google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.2k stars 722 forks source link

Filtering clear false positives / low-confidence sites #645

Closed Dani-kolbe closed 1 year ago

Dani-kolbe commented 1 year ago

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.5/docs/FAQ.md: Yes

Describe the issue: Downstream association analysis has yielded a high number of false positive findings, essentially a product of low quality data. It is crucial that these sites are filtered out.

Setup

Steps to reproduce:

DeepVariant: /opt/deepvariant/bin/run_deepvariant \ --model_type WES \ --ref ${ref} \ --reads ${cram_in} \ --regions ${regions} \ --output_gvcf ${sample}.g.vcf.gz \ --output_vcf ${sample}.vcf.gz \ --num_shards 8 \

GLnexus: glnexus_cli --config DeepVariantWES --bed ${regions} \ 2_gvcf/*.g.vcf.gz > 3_bcf/Exomes.bcf

Any additional context:

Hi there! Apologies for bringing up another similar issue, but I would like some help with the correct filtering of my merged vcf file. Essentially, I have identified a significant number of false positive sites in a downstream assoc. analysis, where MAF for these variants is widely different than the population average. This strongly suggests that these sites are of low quality and need to be filtered out. Here are some examples from the merged vcf file. For each variant I have only shown a handful of samples (total is over 5000):

False positive / bad site that needs filtering:

1 1722625 1_1722625_A_T A T 48 . AF=0.222894;AQ=48 GT:DP:AD:GQ:PL:RNC ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 1/1:7:0,7:42:44,47,0:.. 0/0:0:0,0:1:0,0,0:.. 1/1:6:0,6:36:38,38,0:.. 0/1:12:3,9:0:19,2,0:.. ./.:3:3,0:0:20,0,50:II 1/1:2:0,2:23:29,25,0:.. 0/0:2:2,0:6:0,6,59:.. 1/1:2:0,2:22:31,24,0:.. 1/1:2:0,2:26:28,29,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II 1/1:7:0,7:40:43,42,0:.. ./.:3:3,0:0:20,0,50:II 1/1:4:1,3:1:28,3,0:.. 0/0:0:0,0:1:0,0,0:.. 0/0:33:33,0:50:0,123,1229:.. 1/1:5:3,2:13:21,15,0:.. 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,0,0:.. 0/0:0:0,0:1:0,0,0:.. ./.:3:3,0:0:20,0,50:II 0/0:4:4,0:12:0,12,119:.. 0/0:25:25,0:48:0,75,749:.. 1/1:8:1,7:17:39,19,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 1/1:9:0,9:33:38,35,0:.. ./.:2:2,0:0:23,0,23:II 1/1:2:0,2:28:37,30,0:.. 0/0:20:20,0:50:0,99,989:.. ./.:2:2,0:0:23,0,23:II ./.:2:2,0:0:23,0,23:II ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II 1/1:3:0,3:37:39,40,0:.. 1/1:5:0,5:31:34,33,0:.. ./.:5:5,0:0:14,0,104:II 0/0:26:26,0:18:0,18,659:.. 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. ./.:1:1,0:0:29,3,0:II 1/1:3:0,3:25:33,27,0:.. 0/0:20:20,0:50:0,204,2039:.. ./.:1:1,0:0:29,3,0:II ./.:2:2,0:0:23,0,23:II ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 1/1:3:0,3:26:34,28,0:.. 0/0:0:0,0:1:0,0,0:.. 0/1:56:46,10:26:26,0,32:.. ./.:1:1,0:0:29,3,0:II ./.:2:2,0:0:23,0,23:II 0/0:16:16,0:48:0,144,1439:.. ./.:1:1,0:0:29,3,0:II ./.:2:2,0:0:23,0,23:II ./.:6:6,0:0:11,0,131:II 1/1:5:0,5:32:36,34,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,0,0:.. 1/1:8:1,7:21:36,23,0:.. 1/1:5:0,5:32:37,34,0:.. 1/1:9:0,9:36:39,38,0:.. ./.:1:1,0:0:29,3,0:II ./.:3:3,0:0:20,0,50:II 1/1:5:0,5:30:38,32,0:.. ./.:11:9,2:3:0,2,5:II 0/0:23:23,0:50:0,189,1889:.. ./.:3:3,0:0:20,0,50:II ./.:3:3,0:0:20,0,50:II 0/1:8:3,5:11:29,0,10:.. 0/0:33:33,0:48:0,117,1169:.. 1/1:9:0,9:37:39,41,0:.. 0/0:0:0,0:1:0,0,0:.. 1/1:2:0,2:29:32,31,0:.. ./.:1:1,0:0:29,3,0:II ./.:2:2,0:0:23,0,23:II 0/1:9:2,7:4:15,0,1:.. ./.:7:7,0:0:8,0,158:II 1/1:5:0,5:37:39,41,0:.. 0/1:32:22,10:15:25,0,14:.. 1/1:3:0,3:35:38,37,0:.. 0/0:0:0,0:1:0,0,0:..

There are many ./. and 0/0 with low coverage / confidence, where DV thinks this is a ref. The actual alt-AF of this site should be closer to 0.80.

Here is another:

1 19626624 1_19626624_G_T G T 52 . AF=0.021795;AQ=52 GT:DP:AD:GQ:PL:RNC 0/1:3:0,3:2:24,4,0:.. 0/1:3:0,3:2:23,4,0:.. 1/1:4:0,4:2:28,8,0:.. 1/1:5:0,5:5:34,11,0:.. 0/1:2:0,2:2:19,4,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,3,29:.. 1/1:4:0,4:2:30,8,0:.. 0/1:3:0,3:1:20,5,0:.. 0/1:2:0,2:3:22,3,0:.. 0/0:0:0,0:1:0,3,29:.. 0/1:3:0,3:0:28,6,0:.. 0/1:2:0,2:3:19,3,0:.. ./.:1:1,0:0:29,3,0:II 1/1:4:0,4:2:28,8,0:.. 0/1:2:0,2:0:17,6,0:.. 0/1:2:0,2:3:19,3,0:.. ./.:1:1,0:0:29,3,0:II 0/1:2:0,2:3:14,3,0:.. 0/1:3:0,3:0:27,6,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,3,29:.. ./.:4:4,0:0:17,0,77:II ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II 1/1:4:0,4:2:30,8,0:.. 0/0:0:0,0:1:0,3,29:.. 0/0:0:0,0:1:0,3,29:.. 0/1:3:0,3:0:26,6,0:.. 0/1:7:4,3:27:27,0,45:.. ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II 0/1:2:0,2:2:20,4,0:.. 0/1:2:0,2:3:19,3,0:.. 0/0:0:0,0:1:0,3,29:.. 1/1:4:0,4:3:30,9,0:.. 0/1:3:0,3:0:22,6,0:.. 1/1:4:0,4:1:32,7,0:.. 0/0:2:2,0:6:0,9,89:.. 0/1:2:0,2:2:21,4,0:.. 0/1:2:0,2:2:19,4,0:.. 0/1:2:0,2:2:20,4,0:.. 0/1:3:0,3:1:25,5,0:.. 0/1:3:0,3:2:27,4,0:.. 0/0:0:0,0:1:0,3,29:.. 0/1:2:0,2:3:19,3,0:.. ./.:1:1,0:0:29,3,0:II ./.:1:1,0:0:29,3,0:II 1/1:4:0,4:2:30,8,0:.. ./.:1:1,0:0:29,3,0:II 0/0:0:0,0:1:0,3,29:.. 1/1:8:0,8:12:41,18,0:.. 0/1:2:0,2:2:22,4,0:.. 1/1:6:0,6:6:30,12,0:.. 0/1:2:0,2:2:16,4,0:.. 0/1:2:0,2:3:15,3,0:.. ./.:1:1,0:0:29,3,0:II

The overall coverage of both these sites is really quite low... compared to a site such as this, which has good coverage and no ./. or gVCF generated 0/0:

1 6633042 1_6633042_C_T C T 57 . AF=0.025366;AQ=57 GT:DP:AD:GQ:PL:RNC 0/0:62:62,0:50:0,258,2579:.. 0/0:50:50,0:50:0,180,1799:.. 0/0:60:60,0:50:0,195,1949:.. 0/0:57:57,0:50:0,252,2519:.. 0/0:57:57,0:50:0,228,2279:.. 0/0:39:39,0:50:0,159,1589:.. 0/0:46:46,0:50:0,189,1889:.. 0/0:23:23,0:48:0,69,689:.. 0/0:25:25,0:50:0,75,749:.. 0/0:23:23,0:50:0,69,689:.. 0/0:40:40,0:50:0,189,1889:.. 0/0:63:63,0:50:0,171,1949:.. 0/0:56:56,0:50:0,213,2129:.. 0/0:53:53,0:50:0,159,1589:.. 0/0:60:60,0:50:0,213,2129:.. 0/0:24:24,0:50:0,72,719:.. 0/0:49:49,0:50:0,147,1469:.. 0/0:27:27,0:50:0,51,749:.. 0/0:40:40,0:50:0,156,1559:.. 0/0:19:19,0:50:0,57,569:.. 0/0:43:43,0:50:0,180,1799:.. 0/0:21:21,0:50:0,66,659:.. 0/0:69:69,0:50:0,207,2069:.. 0/0:16:16,0:48:0,48,479:.. 0/0:69:69,0:50:0,207,2069:.. 0/0:46:46,0:50:0,138,1379:.. 0/1:78:36,39:47:48,0,52:.. 0/0:49:49,0:50:0,159,1589:.. 0/0:25:25,0:50:0,81,809:.. 0/0:101:101,0:50:0,300,2999:.. 0/1:77:40,36:50:51,0,55:.. 0/0:65:65,0:50:0,240,2399:.. 0/0:22:22,0:48:0,66,659:.. 0/0:55:55,0:50:0,219,2189:.. 0/0:38:38,0:50:0,210,2099:.. 0/0:56:56,0:50:0,171,1709:.. 0/1:78:32,46:46:49,0,48:..

It would be good to know the most appropriate way to remove the bad variants. As I understand the GQ field is the best place to start. If I am correct, the GLnexus command which I used filters by GQ > 20 ( see: https://github.com/dnanexus-rnd/GLnexus/wiki/Configuration ).

Perhaps upping this GQ to > 30 or slightly higher would remove this sites? I am very appreciative of every input, thank you so much!

danielecook commented 1 year ago

@Dani-kolbe how you filter will be dependent on the type of analysis and how strict you want to be.

Can you clarify - are all of these variants listed considered false positives or just the first two (low coverage) sites?

GQ assesses genotype quality. It will be correlated with depth, but is also impacted by other factors, and as you see the GQ can still be high even for some variants with low depth. Can you apply filteres that target low depth or missingness?

For example:

Dani-kolbe commented 1 year ago

Hi Daniel! Thank you for your reply. I will have a look and get back here with a response. On a side not, is there:

danielecook commented 1 year ago

@Dani-kolbe this filtering process is beyond the scope of DeepVariant, I'm not sure whether GLnexus will be helpful here. Personally, I would use bcftools.

I believe something like this would help with filtering sites with high missing rates:

bcftools filter --include "F_PASS(GT!='mis') > 0.05"

See the bcftools manual and specifically the expressions section for more details.