ncbi / sra-tools

SRA Tools
Other
1.1k stars 242 forks source link

Lowering (or limiting) memory consumption for fastq-dump #424

Closed luizirber closed 3 years ago

luizirber commented 3 years ago

Hello,

I've been seeing large memory consumption (1467 MB for SRR6399464, for example) when downloading data using the following command: fastq-dump --disable-multithreading --fasta 0 --skip-technical --readids --read-filter pass --dumpbase --split-spot --clip -Z SRA_ID

Do you have any tips on how to lower it, or limit it for long-running processes?

For my use case, i don't need paired reads (I need the reads content, but they don't need to be paired), and they don't need to be in any specific order. I'm also streaming the data to another process, and would like to minimize temporary disk space usage (so I'm not using fasterq-dump on purpose).

Thanks, Luiz

durbrow commented 3 years ago

SRR6399464 is aligned-to-human data; at the very least, a dumper will need to pull in the human genome to reconstruct the reads. Whether to disk or to RAM, it must be put somewhere. Increasingly our tools put it in RAM and let the OS sort it out. FYI, SRR6399464 is a 10GB sra file with about 151 million reads (of 151 bases each).

Generally, we don't recommend the use of fastq-dump for aligned data. It is our oldest tool, by and large written before aligned data sets were being submitted to SRA.

luizirber commented 3 years ago

SRR6399464 is aligned-to-human data; at the very least, a dumper will need to pull in the human genome to reconstruct the reads. Whether to disk or to RAM, it must be put somewhere. Increasingly our tools put it in RAM and let the OS sort it out. FYI, SRR6399464 is a 10GB sra file with about 151 million reads (of 151 bases each).

I see, thanks! I've seen datasets taking 9GB+ of memory, but I don't have one running at the moment to report.

I've been using this query to select metagenomes from the SRA. Do you have any suggestions on how to figure out when a dataset needs to pull a genome to recovery reads from an alignment?

Generally, we don't recommend the use of fastq-dump for aligned data. It is our oldest tool, by and large written before aligned data sets were being submitted to SRA.

Hmm, but the alternatives (prefetch, fasterq-dump) require temporary disk space, which increase the management I have to do in pipelines to clean it up after it is used (I don't need the data after I processed it once). The processing I do is lightweight in memory and CPU, but I can't run them in smaller machines because it either requires too much memory (fastq-dump) or more temporary storage (prefetch, fasterq-dump). Do you have any suggestion for downloading SRA datasets in resource-frugal instances?

Thanks!

durbrow commented 3 years ago

It depends on how much work you have to do and how much you want to put into setting it up. Generally the strategies all depend on intelligently batching things and amortizing the costs across batches of similar sets. For example, you would divide the aligned ones from the unaligned ones and start processing the unaligned ones using e.g. fastq-dump. Divide the aligned ones into sets depending on the reference sequences they were aligned to, prefetch the set of references, and sam-dump or fasterq-dump.

And if you REALLY need to process a huge ton of SRA data (some people do!) and you want speed and frugality and are willing to get your hands dirty with some programming, we have API level ways to access the data faster and more precisely than the command line tools can provide.

luizirber commented 3 years ago

It depends on how much work you have to do and how much you want to put into setting it up. Generally the strategies all depend on intelligently batching things and amortizing the costs across batches of similar sets. For example, you would divide the aligned ones from the unaligned ones and start processing the unaligned ones using e.g. fastq-dump. Divide the aligned ones into sets depending on the reference sequences they were aligned to, prefetch the set of references, and sam-dump or fasterq-dump.

Good suggestions, thanks! I like the current simplicity of just receiving an SRA Run ID and processing, but if that's the only option I can adapt and split in different queues and groups.

And if you REALLY need to process a huge ton of SRA data (some people do!) and you want speed and frugality and are willing to get your hands dirty with some programming, we have API level ways to access the data faster and more precisely than the command line tools can provide.

Hmm, I do want to process a ton =] I have 2M+ datasets already processed, but they are the low-hanging fruit (small datasets, can easily parallelize to hundreds of jobs with short runtimes). As soon as I get to longer datasets that end up requiring more memory and runtime I ran out of computational resources to pull the data (or, if moving to AWS, need to spend a lot of credits to spin large instances to account for the memory consumption).

Do you have pointers to how to use the API to process data? Are the py-vdb examples viable, or are they too outdated? Should I go for the C++ API?

osris commented 3 years ago

Are you running pipelines in cloud or on local hosts?

-Chris

From: "Luiz Irber" notifications@github.com<mailto:notifications@github.com> Date: Thursday, October 1, 2020 at 4:33:24 PM To: "ncbi/sra-tools" sra-tools@noreply.github.com<mailto:sra-tools@noreply.github.com> Cc: "Subscribed" subscribed@noreply.github.com<mailto:subscribed@noreply.github.com> Subject: Re: [ncbi/sra-tools] Lowering (or limiting) memory consumption for fastq-dump (#424)

SRR6399464 is aligned-to-human data; at the very least, a dumper will need to pull in the human genome to reconstruct the reads. Whether to disk or to RAM, it must be put somewhere. Increasingly our tools put it in RAM and let the OS sort it out. FYI, SRR6399464 is a 10GB sra file with about 151 million reads (of 151 bases each).

I see, thanks! I've seen datasets taking 9GB+ of memory, but I don't have one running at the moment to report.

I've been using this queryhttps://www.ncbi.nlm.nih.gov/sra/?term=%22METAGENOMIC%22%5Bsource%5D+NOT+amplicon%5BAll+Fields%5D to select metagenomes from the SRA. Do you have any suggestions on how to figure out when a dataset needs to pull a genome to recovery reads from an alignment?

Generally, we don't recommend the use of fastq-dump for aligned data. It is our oldest tool, by and large written before aligned data sets were being submitted to SRA.

Hmm, but the alternatives (prefetch, fasterq-dump) require temporary disk space, which increase the management I have to do in pipelines to clean it up after it is used (I don't need the data after I processed it once). The processing I do is lightweight in memory and CPU, but I can't run them in smaller machines because it either requires too much memory (fastq-dump) or more temporary storage (prefetch, fasterq-dump). Do you have any suggestion for downloading SRA datasets in resource-frugal instances?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/ncbi/sra-tools/issues/424#issuecomment-702381488, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC66ETI7BGAMVPHWYDMQLRLSITRQPANCNFSM4SAYAWFQ.

luizirber commented 3 years ago

Are you running pipelines in cloud or on local hosts?

Currently using HPCs (and any resource I can put my hands on =]), I tried moving to the cloud before but resource consumption with fastq-dump was too high. Trying to solve this now, because at 2TB/day it is going to take too long to process all the data with my current resources.

durbrow commented 3 years ago

Do you have pointers to how to use the API to process data? Are the py-vdb examples viable, or are they too outdated? Should I go for the C++ API?

The python API is maintained, but I don't know about the examples. We use the python bindings for making tests, so we do keep it up-to-date and add stuff to it as needed.

We know someone who is using our APIs to process pretty much the entirety of SRA. https://github.com/ncbi/ngs/issues/27#issue-549840948 Although he is more interested in flat-out speed, he may have some useful experience to share with you.

luizirber commented 3 years ago

The python API is maintained, but I don't know about the examples. We use the python bindings for making tests, so we do keep it up-to-date and add stuff to it as needed.

Nice! I started a new recipe (complementing ncbi-vdb) to expose the Python API: https://github.com/bioconda/bioconda-recipes/pull/24649

We know someone who is using our APIs to process pretty much the entirety of SRA. ncbi/ngs#27 (comment) Although he is more interested in flat-out speed, he may have some useful experience to share with you.

Oh, cool! I will take a look.

durbrow commented 3 years ago

If you choose to go the python route, you might want to get familiar with our vdb-dump tool, which lets you explore the contents of the database files. Of particular use are -E|--table_enum and -O|--column_enum, it can show you the tables, columns, and datatypes within any SRA data file. Generally speaking, for columns that aren't specific to a particular sequencing platform, their datatypes are a constant across the entire SRA. Also try vdb-dump --info <SRR...>.

One caveat about vdb-dump: it is a development tool and a swiss-army-knife, it isn't and doesn't try to be fast.

luizirber commented 3 years ago

If you choose to go the python route, you might want to get familiar with our vdb-dump tool, which lets you explore the contents of the database files. Of particular use are -E|--table_enum and -O|--column_enum, it can show you the tables, columns, and datatypes within any SRA data file. Generally speaking, for columns that aren't specific to a particular sequencing platform, their datatypes are a constant across the entire SRA. Also try vdb-dump --info <SRR...>.

One caveat about vdb-dump: it is a development tool and a swiss-army-knife, it isn't and doesn't try to be fast.

Those are great tips, thanks!

I used L10-fastq.py as an start for a script that reads and outputs sequences (it doesn't even check read name, because I don't need it) here, and I got the same results for fastq-dump --fasta 0 --skip-technical --readids --read-filter pass --dumpbase --split-spot --clip -Z ERR3256923 and python sra_fasta.py --split ERR3256923, but I also noticed that:

I've seen in some issue before (but don't remember which...) that --split-spot can end up using more memory because the paired reads are not necessarily near each other, and more data needs to be kept in memory. Is that true?

I also checked https://github.com/ncbi/ngs/issues/27#issuecomment-575356166 and it seems I can use multiprocessing in Python to pull data in parallel, but... is the Python API thread-safe for these use cases? If I have only one manager and start pulling data, do I risk corrupting it at the VDB level?

Thanks!

jgans commented 3 years ago

I would also recommend checking out ncbi/ncbi-vdb/issues/31, as that post sketches out how to read aligned reads from an SRA file with reduced memory consumption and increased speed.

Here is a C++ code snippet that illustrates the approach in more detail. Beware that this code does not extract partially aligned read pairs (i.e. when one read is aligned and the other read is unaligned) and only returns the reads in the order returned by the SRA toolkit API.

ngs::ReadCollection run(  ncbi::NGS::openReadCollection("SRR6399464") );

// Step 1: Does the SRA record contain aligned reads?
const size_t num_primary_align = run.getAlignmentCount(ngs::Alignment::primaryAlignment);

if(num_primary_align > 0){

    // Yes, there are aligned reads
    ngs::AlignmentIterator align_iter = run.getAlignments(ngs::Alignment::primaryAlignment);

    while(align_iter.nextAlignment() ){

        // Do stuff with the sequence data in align_iter.getAlignedFragmentBases()
    }

    // Step 2: Read the unaligned sequences
    const size_t num_unaligned_read = run.getReadCount(ngs::Read::unaligned);

    if(num_unaligned_read > 0){

        // Need to use getReads() -- getReadRange() does not appear to work for unaligned reads
        ngs::ReadIterator read_iter = ngs::ReadIterator( run.getReads(ngs::Read::unaligned) );

        while( read_iter.nextRead() ){

            while( read_iter.nextFragment() ){

                // Do stuff with the sequence data read_iter.getFragmentBases()         
            }
        }
    }
}
else{ // We not *not* have aligned reads

    const size_t num_read = run.getReadCount(ngs::Read::all);

    ngs::ReadIterator read_iter = 
        ngs::ReadIterator( run.getReadRange ( 1, m_progress.num_read, ngs::Read::all ) );

    while( read_iter.nextRead() ){

        while( read_iter.nextFragment() ){

            // Do stuff with the sequence data in read_iter.getFragmentBases()
        }
    }
}
klymenko commented 3 years ago

@luizirber, do you still need help?