samtools / htslib

C library for high-throughput sequencing data formats
Other
783 stars 447 forks source link

High memory usage with faidx API #1783

Closed tijyojwad closed 1 month ago

tijyojwad commented 1 month ago

Hello,

I am building a tool that needs to randomly access sequence and qualities of various read ids from a large fastq file (~350GB) and the index for that file is ~1GB. It's a multi threaded program and I have roughly the following structure

void thread_fn() {
    faidx_t* index = ... // create index for fastq file
    while (true) {
        // logic
        // use faidx index APIs fai_fetch and fai_fetchqual to get sequence and qualities from random read ids
       // more logic
       if (end_condition) {
            break;
       }
    }
    fai_destroy(index);
}

This thread function is run is 12-16 threads.

However I'm noticing a huge memory footprint when I use faidx. Going up to >300 GB and then eating in swap space before killing the process.

I am freeing the seq and qual pointers returned by the APIs as well. What could be causing this? Am I using an anti-pattern for faidx API? Is there any way to limit the faidx memory footprint?

tijyojwad commented 1 month ago

My apologies, the issue seemed to have been another part of the code that was causing a lot of reads to be read which was causing memory to balloon up. Nothing on faidx side.