sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
217 stars 32 forks source link

Errors in missing vs pad values in VCF #1190

Closed jeromekelleher closed 4 months ago

jeromekelleher commented 5 months ago

Through developing the alternative implementation of vcf-to-zarr conversion in #1185 I think there's some bugs in how we're currently handling missing data. Opening this PR for discussion purposes.

There's some related issues around handling mixed ploidy calls, and string missing values, but let's leave those alone for now.

I realise now that the extra fields I added in to the simple test here duplicates later tests - but they're handy for discussion here for now anyway.

jeromekelleher commented 5 months ago

Basically I think we're returning FILL values when we should be returning missing. Consider INFO/NS in the example:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
19      111     .       A       C       9.6     .       .       GT:HQ   0|0:10,15       0|0:10,10       0/1:3,3
19      112     .       A       G       10      .       .       GT:HQ   0|0:10,10       0|0:10,10       0/1:3,3
20      14370   rs6054257       G       A       29      PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ     0|0:48:1:51,51  1|0:48:8:51,51  1/1:43:5:.,.
20      17330   .       T       A       3       q10     NS=3;DP=11;AF=0.017     GT:GQ:DP:HQ     0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3:.,.
20      1110696 rs6040355       A       G,T     67      PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB       GT:GQ:DP:HQ     1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4:.,.
20      1230237 .       T       .       47      PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ     0|0:54:.:56,60  0|0:48:4:51,51  0/0:61:2:.,.
20      1234567 microsat1       G       GA,GAC  50      PASS    NS=3;DP=9;AA=G;AN=6;AC=3,1      GT:GQ:DP        0/1:.:4 0/2:17:2        ./.:40:3
20      1235237 .       T       .       .       .       .       GT      0/0     0|0     ./.
X       10      rsTest  AC      A,ATG,C 10      PASS    .       GT      0       0/1     0|2

We're currently returning [-2, -2, 3, 3, 2, 3, 3, -2, -2] when it should be [-1, -1, 3, 3, 2, 3, 3, -1, -1]. If the NS info field is not present, then surely that's interpreted as missing data not as a missing dimension in the current data. Dimension padding is a special case, when data is present, but the current row has dimension smaller then the overall column.

That's my interpretation anyway - what do you think @tomwhite?

I'll fix up the tests if we agree there is a bug, but that's the essence of it.

tomwhite commented 5 months ago

Thanks for giving an example @jeromekelleher. I think your interpretation is correct. The NS values are simply missing, not the end of a vector that needs padding/filling.

timothymillar commented 4 months ago

@jeromekelleher your interpretation looks correct to me. We should also test for a case where -2 should appear in an INFO field (e.g., with length A or R):

##INFO=<ID=IDX,Number=R,Type=Integer,Description="Index of each allele">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO          FORMAT        SAMPLE1        SAMPLE2
1       100     .       A       C,T     .       .       IDX=0,1,2     GT:HQ         0|1            0|2 
1       200     .       A       C       .       .       IDX=0,1       GT:HQ         0|0            0|1

With max_alt_alleles=2, the IDX field should correspond to the (variants * alleles) array:

[
    [0, 1,  2],
    [0, 1, -2],
]
jeromekelleher commented 4 months ago

I think this is ready for a look now. There is a little duplication in the tests I've added for the sample VCF and existing ones which have been fixed up, but I think that's OK.

The change is a pretty noisy one I'm afraid, as there's been a few downstream things broken as well (#1195, #1196). I've temporarily skipped those tests to make this more manageable.

jeromekelleher commented 4 months ago

Sigh - skipping those tests pushes the required coverage below 100% so the build still fails.

Any suggestions here? Temporarily push coverage down, and create an issue to track setting it back to 100?

tomwhite commented 4 months ago

Sigh - skipping those tests pushes the required coverage below 100% so the build still fails.

Any suggestions here? Temporarily push coverage down, and create an issue to track setting it back to 100?

I'm OK allowing coverage to dip as long as there is a path to get it back to 100%.

jeromekelleher commented 4 months ago

Thanks @tomwhite - I'm happy to go with your preferred option. If we can fix the other problems fairly easily in this PR then that would be simplest. I'm just worried about doing too many changes at once.

jeromekelleher commented 4 months ago

OK, this is ready for another look @tomwhite. I've temporarily worked around some fiddly corner cases - hopefully we'll still have 100% coverage.

1197 is a tricky one, but I don't think it's worth getting hung up on.

jeromekelleher commented 4 months ago

OK! Looks like it's ready for a final look now @tomwhite.

@timothymillar, this is a reasonably big change so would be good to get your eyes on it as well if possible.