sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
217 stars 32 forks source link

vcf_writer emitting incorrect missing values for INFO fields #1196

Closed jeromekelleher closed 4 months ago

jeromekelleher commented 4 months ago

After correcting the missing-fill-bug (#1192) in #1190, the vcf_writer round trip tests fail. We get:

$ python3 -m pytest -vs sgkit/tests/io/vcf/test_vcf_roundtrip.py::test_vcf_to_zarr_to_vcf__real_files[sample.vcf.gz-None-True]
                ), f"INFO keys not equal for variants\n{v1}{v2}"
E               AssertionError: INFO keys not equal for variants
E                 19    111     .       A       C       9.6     .       .       GT:HQ   0|0:10,15       0|0:10,10       0/1:3,3
E                 19    111     .       A       C       9.6     .       NS=.;AN=.;AC=.,.;DP=.;AF=.,.;AA=.       GT:GQ:DP:HQ     0|0:.:.:10,15   0|0:.:.:10,10   0/1:.:.:3,3
E                 
E               assert dict_keys([]) == dict_keys(['N..., 'AF', 'AA'])
E                 Full diff:
E                 - dict_keys(['NS', 'AN', 'AC', 'DP', 'AF', 'AA'])
E                 + dict_keys([])

It looks like the code is depending on having a FILL value when skipping a VCF key INFO key rather than MISSING?