samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Decompression threads for `bgzip`ped FASTA files #1638

Closed oleksii-nikolaienko closed 11 months ago

oleksii-nikolaienko commented 1 year ago

Hi, I am using faidx.h functions to load compressed human genome into memory. Currently, it does not seem possible to bind decompression threads to file pointer of faidx_t structure, as it's not exposed in header file. For example, this code won't compile:

char fn[] = "/path/to/bgzipped.fasta.fa.gz";
int nthreads = 1;

faidx_t *faidx = fai_load(fn);
hts_tpool *tpool = hts_tpool_init(nthreads);
bgzf_thread_pool(faidx->bgzf, tpool, 0);

Unsavoury hack bgzf_thread_pool(*(BGZF **)faidx, tpool, 0); results in 30-40% faster reading using a single additional thread (please check this discussion).

I don't know if gain is large enough... But would you consider opening faidx_t structure to allow quicker decompression?

daviesrob commented 1 year ago

It wouldn't be too hard to add an API to let you officially thread faidx. We'll look into it.

oleksii-nikolaienko commented 12 months ago

Thanks!