samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
283 stars 242 forks source link

Partially missing AD and PL are treated as empty #1385

Open lbergelson opened 5 years ago

lbergelson commented 5 years ago

As part of reviewing https://github.com/samtools/htsjdk/pull/1372 I noticed that if we parse an AD or PL field in a vcf that has a missing value as one element (ex: 10,.,5 edit:this was previously incorrectly 10:.:5), it looks like we'll currently treat the whole field as missing.

This is a bug and we'll silently lose data. We have a few options:

  1. Do nothing, no one has noticed or cared yet so probably no one cares. This seems practical but also bad.
  2. Throw an exception: easy to do, might cause crashes if these things exist in the wild.
  3. Return a list with null in it. This works for sites that are expecting a List but will probably cause crashes in downstream tools anyway since no one is expecting null values. It's probably the most correct answer though. It will also cause problems for callers that are expecting an array since you can represent null values in an int[].
  4. Emit a warning and treat the field as . Same as now but at least people will know it's happening.

Any thoughts @yfarjoun @ldgauthier

ldgauthier commented 5 years ago

I feel like this should happen all the time when there are phased calls. You're telling me that for

GT:AD:DP:GQ:PGT:PID:PL  0/1:44,6:50:99:0|1:14590_G_A:135,0,2509 0/1:28,8:36:99:0|1:14599_T_A:249,0,1923 0/1:21,6:27:99:0|1:14590_G_A:186,0,1528 0/1:42,9:51:99:0|1:14599_T_A:210,0,2353 0/0:11,0:11:0:.:.:0,0,163   0/0:4,0:4:12:.:.:0,12,119   0/0:15,0:15:39:.:.:0,39,585 ./.:0,0:0:.:.:.:0,0,0   0/0:11,0:11:30:.:.:0,30,382

the homRef calls won't have PLs?\

lbergelson commented 5 years ago

Ack, I wrote it wrong. What I meant was a single element in the PL field is missing, not a different unrelated field is missing. So a PL with a value like 0,.,382 first PL 0, second missing, third 382.

lbergelson commented 5 years ago

For PLs that doesn't make any sense to me, but for AD I can imagine a partially missing AD field making sense.

ldgauthier commented 5 years ago

I've never made a missing AD count -- just zeros. Would a missing count for make more sense? Maybe, maybe not.

I'm going to vote for 4. because I don't want anything to break and I think this isn't a widespread problem. Maybe when we go to 4.3 we can be more strict?

dennishendriksen commented 1 year ago

Hello @lbergelson and HTSJDK developers,

We ran into this issue today when we noticed a loss of AD information. We're using HTSJDK within our human rare disease variant interpretation pipeline. This is an example for a multi-nucleotide variant:

## inputs

# index
1   153233485       .       TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC                                T,<NON_REF>      36.60   .       BaseQRankSum=1.617;ClippingRankSum=0.000;DP=31;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=111600.00;ReadPosRankSum=-3.258                                                           GT:AD:DP:GQ:PL:SB        0/1:20,9,0:29:44:44,0,752,105,769,873:11,9,3,6
--> index: `GT=0/1, AD=20,9,0`
1   153233488       rs150026164     C       CGGCGGT,<NON_REF>       455.64  .       BaseQRankSum=-1.731;ClippingRankSum=0.000;DB;DP=27;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=97200.00;ReadPosRankSum=3.949  GT:AD:DP:GQ:PL:SB       0/1:10,12,0:22:84:463,0,84,484,120,603:4,6,3,9
--> index: `GT=0/1, AD=10,12,0`

# mother
1   153233485       .       T       <NON_REF>       .       .       END=153233485   GT:DP:GQ:MIN_DP:PL                                                                                                          0/0:51:39:51:0,39,1090
1   153233488       rs150026164     C       CGGCGGT,<NON_REF>       881.06  .       DB;DP=31;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=111600.00                                                        GT:AD:DP:GQ:PL:SB        1/1:0,22,0:22:65:895,65,0,895,66,896:0,0,11,11

# father
1   153233485       .       T       <NON_REF>       .       .       END=153233485   GT:DP:GQ:MIN_DP:PL                                                                                                          0/0:33:51:33:0,51,765
1   153233488       rs150026164     C       CGGCGGT,<NON_REF>       44.64   .       BaseQRankSum=-1.069;ClippingRankSum=0.000;DB;DP=15;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQ=54000.00;ReadPosRankSum=1.290  GT:AD:DP:GQ:PL:SB       0/1:4,7,0:11:52:52,0,120,64,139,202:2,2,3,4

## step 1: joint variant calling using GLnexus
1       153233485       1_153233488_C_CGGCGGT;1_153233485_TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC_T        TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC    TGGCGGCGGTGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC,T    895     .       AF=0.666667,0.166667;AQ=895,44  GT:DP:AD:SB:GQ:PL:RNC   1/2:22:.,12,9:.,.,.,.:44:.:..   1/1:22:0,22,0:.,.,.,.:65:895,65,0,.,.,.:..      0/1:11:4,7,0:.,.,.,.:52:52,0,120,.,.,.:..
--> index: `GT=1/2, AD=.,12,9`

## step 2: variant normalization using BCFtools
1       153233485       1_153233488_C_CGGCGGT;1_153233485_TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC_T        TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC    T       895     .       AF=0.166667;AQ=44       GT:DP:AD:SB:GQ:PL:RNC   0/1:22:.,9:.,.,.,.:44:.:..      0/0:22:0,0:.,.,.,.:65:895,.,.:..        0/0:11:4,0:.,.,.,.:52:52,.,.:..
--> index: GT=0/1,AD=.,9
1       153233488       1_153233488_C_CGGCGGT;1_153233485_TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC_T        C       CGGCGGT 895     .       AF=0.666667;AQ=895      GT:DP:AD:SB:GQ:PL:RNC   1/0:22:.,12:.,.,.,.:44:.:..     1/1:22:0,22:.,.,.,.:65:895,65,0:..      0/1:11:4,7:.,.,.,.:52:52,0,120:..
--> index: GT=1/0,AD=.,12

## step 3: our own tool using HTSJDK
1       153233485       1_153233488_C_CGGCGGT;1_153233485_TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC_T        TGGCGGCGGTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCC    T       895     .       AF=0.166667;AQ=44;CSQ=-|inframe_deletion|MODERATE|LOR|4014|Transcript|NM_000427.3|protein_coding|2/2||NM_000427.3:c.69_215del|NP_000418.2:p.Gly28_Gly76del|139-285/1245|69-215/939|23-72/312|GGGGGGSGGGGCGFFGGGGSGGGSSGSGCGYSGGGGYSGGGCGGGSSGGG/G|ggTGGCGGTGGCGGCGGCAGCGGCGGTGGTGGCTGCGGCTTCTTCGGCGGCGGCGGCTCAGGGGGCGGTAGCAGCGGTTCTGGCTGCGGCTACTCCGGCGGCGGTGGCTACTCTGGCGGCGGCTGCGGCGGGGGCTCCTCCGGCGGCGGg/ggg||1||1||1|EntrezGene|||||||8|||||||||||||||||||||VUS|0.62841135||||||||AD||||||||||||||||||||||keep|phix&filter&is_cnv&artefact&hla&gnomad_af&gnomad_hn&spliceAI&capice&exit_keep   GT:AD:DP:GQ:RNC:SB:VI:VIC:VID:VIG:VIM   0/1:.:22:44:..:.,.,.,.:AD:.:1:4014:1    0/0:0,0:22:65   0/0:4,0:11:52
--> index: GT=1/0,AD=.

Personally I would prefer option 3., in my opinion a general purpose tool should strife for correctness. For that same reason I'm not in favor of 2. and 4. since this is actually valid VCF that has it use cases. I agree that 1. is practical, it seems to have worked out well over the last couple of years :)

How about these options:

  1. Replace '.' with 0, similar to the GATK death of the dot solution
  2. Implement 3. and add an option for backwards compatible behavior

Thank you for your great tools, highly appreciated!

Greetings, @dennishendriksen