sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
231 stars 32 forks source link

partition_into_regions returns empty region for all contigs in header #1200

Closed jeromekelleher closed 7 months ago

jeromekelleher commented 7 months ago

This line adds an empty region for every contig that's in the VCFs header in partition_into_regions. However, it's common for VCFs to have contigs declared in the header that are not in the file at all. For example, recent 1000 Genomes data declares all contigs in all VCF files (there are thousands). This generates a lot of noise in the returned region strings.

What was the rationale for returning the empty regions @tomwhite?

jeromekelleher commented 7 months ago

Disabling doesn't break any tests as it was pragma: no covered

jeromekelleher commented 7 months ago

Possible explanation for #1169 ?

tomwhite commented 7 months ago

What was the rationale for returning the empty regions @tomwhite?

Not sure. It seems it can be removed though.

jeromekelleher commented 7 months ago

Digging deeper, it's not quite as simple as this. Turn out that all three of CSI indexed BCF, CSI indexed VCF and tabix indexed VCF need to be treated slightly differently in these cases in which there are multiple contigs defined in the header :vomiting_face:

jeromekelleher commented 7 months ago

Closing in favour of #1202 (more precisely specified)