ncbi / sra-tools

SRA Tools
Other
1.07k stars 243 forks source link

fasterq-dump overloads memory #903

Open mfansler opened 5 months ago

mfansler commented 5 months ago

I have installed sra-tools v3.0.10 distributed from Bioconda for linux-64 platform. Running fasterq-dump occupies far more RAM than the flags would imply (default 100 MB/core) or I have ever encountered before using identical commands. In previous versions, I always used 8 cores + 1GB/core, with -t pointing to local scratch disk and VDB configured with plenty of room for the ncbi/sra cache. E.g.,

fasterq-dump -e 8 -S --include-technical -o /fscratch/fanslerm/rc11_d8_1_2.fastq -t /fscratch/fanslerm SRR9117967

Using the above for any SRRs from PRJNA544617 ends with LSF killing my jobs for exceeding memory. I have retried with:

all eventually killed for overallocating memory. I am currently running again with 4 cores + 8 GB/core (32 GB total).

This makes me suspect there is something off in this version with possibly:

Please let me know if I can provide any additional information.

mfansler commented 5 months ago

I also tried running on a local Docker (mambaorg/micromamba:1.5.6) rather than HPC, with -e2 and 16 GB total on the container. This was also killed.

OOAAHH commented 4 months ago

我遇到了类似的问题,my codes sratoolkit.3.0.10-centos_linux64/bin/fasterq-dump.3.0.10 --split-3 ./ERR4027871.sra --include-technical -O ~/TOS/output -v -p

截屏2024-03-04 11 53 50

我的HPC节点宕机了,该问题反复出现,我也怀疑是新版本的问题,但我不太明白我该如何寻找恰当的证据证明这一点。 My admin told me it was definitely an OOM issue. My HPC node is going down, and the problem keeps coming up, and I suspect it's the newer version, but I'm not sure how I can find proper proof of this.

mfansler commented 4 months ago

For completeness, I did eventually get it to complete with the 4 core and 8GB/core configuration. I expect this will be dependent on the size of the data.

mfansler commented 4 months ago

@OOAAHH I was able to run your example without any issue. The SRA file is 14GB, and unpacked it leads to a 26GB FASTQ file. Are you sure you are not running out of disk quota?

Some things I see: Your example does not provide a scratch space to store the temporary files, so they will be written to a temporary folder in the current directory. Also, unless ~/TOS/output is symlinked elsewhere, that is under user home (~/) which on typical HPC clusters is 100GB. Lastly, have you configured VDB so that the NCBI cache is not under user home (the default)? That is, from worst case assumptions, this single operation could occupy up to 75GB of disk at maximum occupancy.

It should further be noted that this particular data was uploaded as an aligned BAM. Dumping out a FASTQ file from a BAM-derived SRA file is mostly useless for scRNA-seq because any cell barcodes and UMIs will only be in the tags and not get properly dumped out. I don't know what you plan with the data, but for processing as scRNA-seq you are likely better off downloading the BAM (and .bai) directly from the ENA (see ERR4027871).

OOAAHH commented 4 months ago

First of all, thank you for your prompt and detailed response. Your insights have been incredibly helpful and have shed light on several oversight areas in my approach.

mfansler commented 4 months ago

Glad to help. Fortunately, the .bai files shouldn't be essential - one can reindex with samtools index to generate new ones.

OOAAHH commented 3 months ago

I hope this message finds you well. I wanted to take a moment to update you on the significant progress I've made, thanks in large part to your invaluable advice and guidance.

Following your suggestions, I revisited my BAM files and utilized samtools to reindex them and examine the metadata more closely. This process was incredibly enlightening; not only was I able to generate new .bai files successfully, but I also uncovered crucial information embedded within the BAM files. The metadata and initial read segments revealed essential details such as cell barcodes, UMIs, and sample identifiers - precisely the data I needed for my single-cell RNA sequencing analysis.

Discovering this information was particularly critical for me, given the challenging network environment I am operating in, which makes downloading genomic data quite difficult. Being able to extract and utilize data already within my possession has saved me a tremendous amount of time! My codes: samtools view -H

截屏2024-03-06 16 02 19

samtools view my.bam | head

截屏2024-03-06 16 44 09