sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
231 stars 32 forks source link

partition_into_regions: CSI indexed VCFs have incorrect sequence names #1201

Open jeromekelleher opened 7 months ago

jeromekelleher commented 7 months ago

The index names for CSI indexed VCFs must be derived from the index itself, because sequence names in an indexed VCF refer to observed sequences, not those that are listed in the header. The correct logic (I hope) is here:

https://github.com/jeromekelleher/bio2zarr/blob/880c3afee4465b4b94b921c815d436f3e4a78a46/bio2zarr/vcf_utils.py#L400

Some tests that should be straightforward to port to sgkit are here: https://github.com/jeromekelleher/bio2zarr/blob/880c3afee4465b4b94b921c815d436f3e4a78a46/tests/test_vcf_utils.py#L21