Closed jeromekelleher closed 4 months ago
Basically I think we're returning FILL values when we should be returning missing. Consider INFO/NS in the example:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
19 111 . A C 9.6 . . GT:HQ 0|0:10,15 0|0:10,10 0/1:3,3
19 112 . A G 10 . . GT:HQ 0|0:10,10 0|0:10,10 0/1:3,3
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3:.,.
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4:.,.
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:.:56,60 0|0:48:4:51,51 0/0:61:2:.,.
20 1234567 microsat1 G GA,GAC 50 PASS NS=3;DP=9;AA=G;AN=6;AC=3,1 GT:GQ:DP 0/1:.:4 0/2:17:2 ./.:40:3
20 1235237 . T . . . . GT 0/0 0|0 ./.
X 10 rsTest AC A,ATG,C 10 PASS . GT 0 0/1 0|2
We're currently returning [-2, -2, 3, 3, 2, 3, 3, -2, -2]
when it should be [-1, -1, 3, 3, 2, 3, 3, -1, -1]
. If the NS info field is not present, then surely that's interpreted as missing data not as a missing dimension in the current data. Dimension padding is a special case, when data is present, but the current row has dimension smaller then the overall column.
That's my interpretation anyway - what do you think @tomwhite?
I'll fix up the tests if we agree there is a bug, but that's the essence of it.
Thanks for giving an example @jeromekelleher. I think your interpretation is correct. The NS values are simply missing, not the end of a vector that needs padding/filling.
@jeromekelleher your interpretation looks correct to me. We should also test for a case where -2
should appear in an INFO field (e.g., with length A
or R
):
##INFO=<ID=IDX,Number=R,Type=Integer,Description="Index of each allele">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2
1 100 . A C,T . . IDX=0,1,2 GT:HQ 0|1 0|2
1 200 . A C . . IDX=0,1 GT:HQ 0|0 0|1
With max_alt_alleles=2
, the IDX field should correspond to the (variants * alleles) array:
[
[0, 1, 2],
[0, 1, -2],
]
I think this is ready for a look now. There is a little duplication in the tests I've added for the sample VCF and existing ones which have been fixed up, but I think that's OK.
The change is a pretty noisy one I'm afraid, as there's been a few downstream things broken as well (#1195, #1196). I've temporarily skipped those tests to make this more manageable.
Sigh - skipping those tests pushes the required coverage below 100% so the build still fails.
Any suggestions here? Temporarily push coverage down, and create an issue to track setting it back to 100?
Sigh - skipping those tests pushes the required coverage below 100% so the build still fails.
Any suggestions here? Temporarily push coverage down, and create an issue to track setting it back to 100?
I'm OK allowing coverage to dip as long as there is a path to get it back to 100%.
Thanks @tomwhite - I'm happy to go with your preferred option. If we can fix the other problems fairly easily in this PR then that would be simplest. I'm just worried about doing too many changes at once.
OK, this is ready for another look @tomwhite. I've temporarily worked around some fiddly corner cases - hopefully we'll still have 100% coverage.
OK! Looks like it's ready for a final look now @tomwhite.
@timothymillar, this is a reasonably big change so would be good to get your eyes on it as well if possible.
Through developing the alternative implementation of vcf-to-zarr conversion in #1185 I think there's some bugs in how we're currently handling missing data. Opening this PR for discussion purposes.
There's some related issues around handling mixed ploidy calls, and string missing values, but let's leave those alone for now.
I realise now that the extra fields I added in to the simple test here duplicates later tests - but they're handy for discussion here for now anyway.