vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

Autoindex should parse tabix-indexed monolithic VCFs in parallel #4277

Open jeizenga opened 2 months ago

jeizenga commented 2 months ago

We've had a few users complain about autoindex's excessively slow chunking process for VCFs when they are provided as a single file for all chromosomes (e.g. https://github.com/vgteam/vg/issues/4274). This results from a single-threaded linear scan over the VCF to parcel it out to chunks that subsequently run in parallel. If the VCF is tabix-indexed, it should be possible to chunk the VCF in parallel across chromosomes, which would alleviate this issue.