Open lbergelson opened 5 years ago
I feel like this should happen all the time when there are phased calls. You're telling me that for
GT:AD:DP:GQ:PGT:PID:PL 0/1:44,6:50:99:0|1:14590_G_A:135,0,2509 0/1:28,8:36:99:0|1:14599_T_A:249,0,1923 0/1:21,6:27:99:0|1:14590_G_A:186,0,1528 0/1:42,9:51:99:0|1:14599_T_A:210,0,2353 0/0:11,0:11:0:.:.:0,0,163 0/0:4,0:4:12:.:.:0,12,119 0/0:15,0:15:39:.:.:0,39,585 ./.:0,0:0:.:.:.:0,0,0 0/0:11,0:11:30:.:.:0,30,382
the homRef calls won't have PLs?\
Ack, I wrote it wrong. What I meant was a single element in the PL field is missing, not a different unrelated field is missing. So a PL with a value like 0,.,382
first PL 0, second missing, third 382.
For PLs that doesn't make any sense to me, but for AD I can imagine a partially missing AD field making sense.
I've never made a missing AD count -- just zeros. Would a missing count for
I'm going to vote for 4. because I don't want anything to break and I think this isn't a widespread problem. Maybe when we go to 4.3 we can be more strict?
Hello @lbergelson and HTSJDK developers,
We ran into this issue today when we noticed a loss of AD information. We're using HTSJDK within our human rare disease variant interpretation pipeline. This is an example for a multi-nucleotide variant:
## inputs
# index
1 153233485 . TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC T,<NON_REF> 36.60 . BaseQRankSum=1.617;ClippingRankSum=0.000;DP=31;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=111600.00;ReadPosRankSum=-3.258 GT:AD:DP:GQ:PL:SB 0/1:20,9,0:29:44:44,0,752,105,769,873:11,9,3,6
--> index: `GT=0/1, AD=20,9,0`
1 153233488 rs150026164 C CGGCGGT,<NON_REF> 455.64 . BaseQRankSum=-1.731;ClippingRankSum=0.000;DB;DP=27;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=97200.00;ReadPosRankSum=3.949 GT:AD:DP:GQ:PL:SB 0/1:10,12,0:22:84:463,0,84,484,120,603:4,6,3,9
--> index: `GT=0/1, AD=10,12,0`
# mother
1 153233485 . T <NON_REF> . . END=153233485 GT:DP:GQ:MIN_DP:PL 0/0:51:39:51:0,39,1090
1 153233488 rs150026164 C CGGCGGT,<NON_REF> 881.06 . DB;DP=31;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=111600.00 GT:AD:DP:GQ:PL:SB 1/1:0,22,0:22:65:895,65,0,895,66,896:0,0,11,11
# father
1 153233485 . T <NON_REF> . . END=153233485 GT:DP:GQ:MIN_DP:PL 0/0:33:51:33:0,51,765
1 153233488 rs150026164 C CGGCGGT,<NON_REF> 44.64 . BaseQRankSum=-1.069;ClippingRankSum=0.000;DB;DP=15;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=54000.00;ReadPosRankSum=1.290 GT:AD:DP:GQ:PL:SB 0/1:4,7,0:11:52:52,0,120,64,139,202:2,2,3,4
## step 1: joint variant calling using GLnexus
1 153233485 1_153233488_C_CGGCGGT;1_153233485_TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC_T TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC TGGCGGCGGTGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC,T 895 . AF=0.666667,0.166667;AQ=895,44 GT:DP:AD:SB:GQ:PL:RNC 1/2:22:.,12,9:.,.,.,.:44:.:.. 1/1:22:0,22,0:.,.,.,.:65:895,65,0,.,.,.:.. 0/1:11:4,7,0:.,.,.,.:52:52,0,120,.,.,.:..
--> index: `GT=1/2, AD=.,12,9`
## step 2: variant normalization using BCFtools
1 153233485 1_153233488_C_CGGCGGT;1_153233485_TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC_T TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC T 895 . AF=0.166667;AQ=44 GT:DP:AD:SB:GQ:PL:RNC 0/1:22:.,9:.,.,.,.:44:.:.. 0/0:22:0,0:.,.,.,.:65:895,.,.:.. 0/0:11:4,0:.,.,.,.:52:52,.,.:..
--> index: GT=0/1,AD=.,9
1 153233488 1_153233488_C_CGGCGGT;1_153233485_TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC_T C CGGCGGT 895 . AF=0.666667;AQ=895 GT:DP:AD:SB:GQ:PL:RNC 1/0:22:.,12:.,.,.,.:44:.:.. 1/1:22:0,22:.,.,.,.:65:895,65,0:.. 0/1:11:4,7:.,.,.,.:52:52,0,120:..
--> index: GT=1/0,AD=.,12
## step 3: our own tool using HTSJDK
1 153233485 1_153233488_C_CGGCGGT;1_153233485_TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC_T TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC T 895 . AF=0.166667;AQ=44;CSQ=-|inframe_deletion|MODERATE|LOR|4014|Transcript|NM_000427.3|protein_coding|2/2||NM_000427.3:c.69_215del|NP_000418.2:p.Gly28_Gly76del|139-285/1245|69-215/939|23-72/312|GGGGGGSGGGGCGFFGGGGSGGGSSGSGCGYSGGGGYSGGGCGGGSSGGG/G|ggTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCCGGCGGCGGg/ggg||1||1||1|EntrezGene|||||||8|||||||||||||||||||||VUS|0.62841135||||||||AD||||||||||||||||||||||keep|phix&filter&is_cnv&artefact&hla&gnomad_af&gnomad_hn&spliceAI&capice&exit_keep GT:AD:DP:GQ:RNC:SB:VI:VIC:VID:VIG:VIM 0/1:.:22:44:..:.,.,.,.:AD:.:1:4014:1 0/0:0,0:22:65 0/0:4,0:11:52
--> index: GT=1/0,AD=.
Personally I would prefer option 3., in my opinion a general purpose tool should strife for correctness. For that same reason I'm not in favor of 2. and 4. since this is actually valid VCF that has it use cases. I agree that 1. is practical, it seems to have worked out well over the last couple of years :)
How about these options:
Thank you for your great tools, highly appreciated!
Greetings, @dennishendriksen
As part of reviewing https://github.com/samtools/htsjdk/pull/1372 I noticed that if we parse an AD or PL field in a vcf that has a missing value as one element (ex:
10,.,5
edit:this was previously incorrectly10:.:5
), it looks like we'll currently treat the whole field as missing.This is a bug and we'll silently lose data. We have a few options:
null
in it. This works for sites that are expecting aList
but will probably cause crashes in downstream tools anyway since no one is expecting null values. It's probably the most correct answer though. It will also cause problems for callers that are expecting an array since you can represent null values in an int[]..
Same as now but at least people will know it's happening.Any thoughts @yfarjoun @ldgauthier