samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

feature request: parallelize tabix #1735

Closed dbolser closed 3 months ago

dbolser commented 5 months ago

I'd like to index a bed file (tbi) using bcftools (because of the ability to use multiple threads).

Is this a feature that you could easily support?

Many thanks, Dan.

pd3 commented 5 months ago

Thank you for the inquiry, but no, it's not something bcftools aspires to do. The right tool to use for this is tabix, provided by htslib.

dbolser commented 5 months ago

Tabix is great, but it's not parallel.

dbolser commented 5 months ago

Thanks for transferring it here. thought bcftools because you already have parallel indexing of BCF files and parsers for BED, but I guess bcf is nothing like bgzip'ed bed.

jkbonfield commented 5 months ago

Is the issue here simply the multi-threaded bgzf decoding? I admit I assumed we'd got that enabled for everything, but indeed tabix doesn't do it apparently. I'm not sure why - either we simply missed it or perhaps it was testing and gave no major benefit (eg the primary CPU burden is elsewhere).

dbolser commented 5 months ago

I could imagine it's IO bound... perhaps...