sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
217 stars 32 forks source link

partition_into_regions: CSI indexed BCFs return empty regions #1202

Open jeromekelleher opened 4 months ago

jeromekelleher commented 4 months ago

Because BCF CSI indexes store information for all contigs listed in the header, we need to filter out regions that have a zero counts like here: https://github.com/jeromekelleher/bio2zarr/blob/880c3afee4465b4b94b921c815d436f3e4a78a46/bio2zarr/vcf_utils.py#L510

While returning empty contigs is reasonably harmless, it's not if there are thousands of contigs listed in the header which is common.