ncbi / sra-human-scrubber

An SRA tool that takes as input local fastq file from a clinical infection sample, identifies and removes any significant human read, and outputs the edited (cleaned) fastq file that can safely be used for SRA submission.
Other
45 stars 6 forks source link

Memory usage #25

Closed bede closed 1 year ago

bede commented 1 year ago

Hi there, Thanks for creating and sharing this tool. Does HRRT load the entire input into RAM? I've noticed a Python process created by HRRT using more than 60GB of RAM while processing a 20GB gzipped fastq piped in with zcat and using -x. This is making it difficult to evaluate HRRT with large FASTQs.

Thanks, Bede

multikengineer commented 1 year ago

Bede, The python scripts simply read and write line by line, so what you are seeing is python buffering I would guess. The binary code (not python) does load the entire human filter db into memory but it is only 1GB.

bede commented 1 year ago

Thank you, in that case are you able to help me troubleshoot crashes like this?

I logged the RSS (kilobytes) of HRRT's Python process once per second using ps. Here are the final 10s of log before HRRT terminated. Notice the steady climb followed by the rapid spike to 6.8GB during the three seconds prior to termination. This is with a 3GB fastq.gz and 8GB of available RAM (Ubuntu x86 VM).

python3 /root/sra-human-scr 4607108
python3 /root/sra-human-scr 4610804
python3 /root/sra-human-scr 4614500
python3 /root/sra-human-scr 4618196
python3 /root/sra-human-scr 4622420
python3 /root/sra-human-scr 4626644
python3 /root/sra-human-scr 4630868
python3 /root/sra-human-scr 5200316
python3 /root/sra-human-scr 6562312
python3 /root/sra-human-scr 6762052

The command used:

zcat 125m.ERR3242910.fastq.gz | sra-human-scrubber-2.1.0/scripts/scrub.sh -x -p 4 | gzip > 125m.ERR3242910.hrrt.fastq.gz

Terminates like so at the same moment RSS spikes to 6.8GB:

2023-07-18 11:43:14     59% processed
2023-07-18 11:43:24     60% processed
2023-07-18 11:43:34     61% processed
2023-07-18 11:43:44     62% processed
2023-07-18 11:43:55     63% processed
2023-07-18 11:44:05     64% processed
2023-07-18 11:44:15     65% processed
sra-human-scrubber-2.1.0/scripts/scrub.sh: line 99: 161531 Broken pipe             "${ROOT}"/bin/aligns_to -db "${DB}" $(if [[ "$THREADS" =~ ^[0-9]+$ ]]; then printf "%s" "-num_threads $THREADS"; fi) "$TMP_F_DIR/temp.fasta"
     161532 Killed                  | "$ROOT/scripts/cut_spots_fastq.py" "$INFILE" "$REPLACEN"

For larger FASTQs I have needed to use a 128GB machine to accommodate HRRT's memory usage spiking above 60GB

multikengineer commented 1 year ago

Bede, your example is a 3GB fastq.gz and 8GB of available RAM (Ubuntu x86 VM). That 3GB decompressed is much bigger so I suspect that is your memory issue.

bede commented 1 year ago

Ok, so the entire input is loaded in-memory as well as being stored in /tmp. That answers my question. It would be great if HRRT didn't use so much memory, but I appreciate that this could involve a lot of work.

multikengineer commented 1 year ago

Strictly speaking: the entire fasta version of the input fastq file is stored in tmp (and then removed). It is read one line at a time from stdin but there will python buffering does occur and I guess that is loading up your mem. The only thing the "code" explicitly loads to mem is the 1G database: everything else is python buffering though the lines are read one line at a time from stdin.