sunbeam-labs / sunbeam

A robust, extensible metagenomics pipeline
http://sunbeam.readthedocs.io
166 stars 40 forks source link

Sunbeam memory leak when running kraken2 #296

Closed yattaa closed 1 year ago

yattaa commented 3 years ago

Hello,

I'm running into an issue when running the kraken2_classify_report rule.

It appears that the system's memory (268GB) keeps increasing when switching to different samples until no space is available anymore, which is when the error seems to occur.

Running the PlusPF kraken database for classification with the latest sunbeam version (dev branch).

Worth noting; I'm running this on 12 samples. Each sample's file size is about 160MB.

Here's the error message: ------------------------------------------------------> Finished job 5160. 4835 of 27378 steps (18%) done [Mon May 17 12:41:59 2021] Error in rule kraken2_classify_report: jobid: 470 output: /P00CD/sunbeam_output/classify/kraken/raw/S00CD-1156_S196-raw.tsv, /P00CD/sunbeam_output/classify/kraken/S00CD-1156_S196-taxa.tsv shell:

    kraken2 --gzip-compressed                 --db /databases/classify                 --report /P00CD/sunbeam_output/classify/kraken/S00CD-1156_S196-taxa.tsv                 --paired /P00CD/sunbeam_output/qc/decontam/S00CD-1156_S196_1.fastq.gz /P00CD/sunbeam_output/qc/decontam/S00CD-1156_S196_2.fastq.gz                 > /P00CD/sunbeam_output/classify/kraken/raw/S00CD-1156_S196-raw.tsv

    (exited with non-zero exit code)

Removing output files of failed job kraken2_classify_report since they might be corrupted: /P00CD/sunbeam_output/classify/kraken/raw/S00CD-1156_S196-raw.tsv [Mon May 17 12:42:01 2021] Finished job 4444. 4836 of 27378 steps (18%) done [Mon May 17 12:42:04 2021] Finished job 2080. 4837 of 27378 steps (18%) done Removing temporary output file /P00CD/sunbeam_output/qc/decontam/intermediates/S00CD-0453_S357_hostreads.ids. [Mon May 17 12:42:04 2021] Finished job 2081. 4838 of 27378 steps (18%) done Removing temporary output file /P00CD/sunbeam_output/qc/decontam/intermediates/S00CD-1831_S7_hostreads.ids. [Mon May 17 12:42:05 2021] Finished job 4445. 4839 of 27378 steps (18%) done [Mon May 17 12:42:05 2021] Finished job 5376. 4840 of 27378 steps (18%) done [Mon May 17 12:42:06 2021] Finished job 2608. 4841 of 27378 steps (18%) done Removing temporary output file /P00CD/sunbeam_output/qc/decontam/intermediates/S00CD-1059_S99_hostreads.ids. [Mon May 17 12:42:06 2021] Finished job 5377. 4842 of 27378 steps (18%) done [Mon May 17 12:42:06 2021] Finished job 5078. 4843 of 27378 steps (18%) done Removing temporary output file /P00CD/sunbeam_output/qc/decontam/intermediates/S00CD-0560_S80_hostreads.ids. [Mon May 17 12:42:06 2021] Finished job 2609. 4844 of 27378 steps (18%) done Removing temporary output file /P00CD/sunbeam_output/qc/decontam/intermediates/S00CD-1705_S265_hostreads.ids. [Mon May 17 12:42:06 2021] Finished job 5079. 4845 of 27378 steps (18%) done [Mon May 17 12:42:20 2021] Error in rule kraken2_classify_report: jobid: 1568 output: /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0032_S32-raw.tsv, /P00CD/sunbeam_output/classify/kraken/S00CD-0032_S32-taxa.tsv shell:

    kraken2 --gzip-compressed                 --db /databases/classify                 --report /P00CD/sunbeam_output/classify/kraken/S00CD-0032_S32-taxa.tsv                 --paired /P00CD/sunbeam_output/qc/decontam/S00CD-0032_S32_1.fastq.gz /P00CD/sunbeam_output/qc/decontam/S00CD-0032_S32_2.fastq.gz                 > /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0032_S32-raw.tsv

    (exited with non-zero exit code)

Removing output files of failed job kraken2_classify_report since they might be corrupted: /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0032_S32-raw.tsv [Mon May 17 12:42:42 2021] Error in rule kraken2_classify_report: jobid: 1285 output: /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0965_S5-raw.tsv, /P00CD/sunbeam_output/classify/kraken/S00CD-0965_S5-taxa.tsv shell:

    kraken2 --gzip-compressed                 --db /databases/classify                 --report /P00CD/sunbeam_output/classify/kraken/S00CD-0965_S5-taxa.tsv                 --paired /P00CD/sunbeam_output/qc/decontam/S00CD-0965_S5_1.fastq.gz /P00CD/sunbeam_output/qc/decontam/S00CD-0965_S5_2.fastq.gz                 > /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0965_S5-raw.tsv

    (exited with non-zero exit code)

Removing output files of failed job kraken2_classify_report since they might be corrupted: /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0965_S5-raw.tsv [Mon May 17 12:43:07 2021] Error in rule kraken2_classify_report: jobid: 209 output: /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0659_S179-raw.tsv, /P00CD/sunbeam_output/classify/kraken/S00CD-0659_S179-taxa.tsv shell:

    kraken2 --gzip-compressed                 --db /databases/classify                 --report /P00CD/sunbeam_output/classify/kraken/S00CD-0659_S179-taxa.tsv                 --paired /P00CD/sunbeam_output/qc/decontam/S00CD-0659_S179_1.fastq.gz /P00CD/sunbeam_output/qc/decontam/S00CD-0659_S179_2.fastq.gz                 > /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0659_S179-raw.tsv

    (exited with non-zero exit code)

Removing output files of failed job kraken2_classify_report since they might be corrupted: /P00CD/sunbeam_output/classify/kraken/raw/S00CD-0659_S179-raw.tsv [Mon May 17 12:44:20 2021] Finished job 1316. 4846 of 27378 steps (18%) done [Mon May 17 12:44:24 2021] Finished job 1200. 4847 of 27378 steps (18%) done [Mon May 17 12:44:30 2021] Finished job 1752. 4848 of 27378 steps (18%) done [Mon May 17 12:44:35 2021] Finished job 399. 4849 of 27378 steps (18%) done Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /P00CD/.snakemake/log/2021-05-17T111319.605115.snakemake.log <------------------------------------------------------

Could anyone please share any idea as to what could be causing this?

Thank you! hatem

yattaa commented 3 years ago

Hi @ressy and @louiejtaylor - Any thoughts?

louiejtaylor commented 3 years ago

Hi hatem, thanks for the issue! This is not a problem I've seen before; the only kraken2 memory issues I've seen have been when users are using a database larger than the available system memory, which does not seem to be the case here. If my googling is correct, the PlusPF database has a ~50GB footprint, which should be easily handled for you.

Looking at your sunbeam output, your data are paired, correct? I ask because the only memory leak issue I found relating to kraken2 were these two issues relating to using unpaired data.

Can you check to see what kraken2 version you have in your environment?

conda activate sunbeam
kraken2 --version

Perhaps updating to a newer version (if one is available) will fix your issue--if not, we can try debugging further. I'm not super experienced with troubleshooting memory issues, though, so we may have to bring this over to the kraken2 folks at some point if it seems like a kraken-specific issue.

yattaa commented 3 years ago

Hi there, Louis! Thank you for your reply!

Yes, you are correct: the PlusPF database is about 50GB, and that is not an issue. And yes, the data is definitely paired.

I'm using kraken2 version 2.0.8-beta.

I have uploaded a screen recording ( ➡️ https://www.youtube.com/watch?v=laInPKbTjqo ⬅️ ) showing what's running in more detail and how the memory usage keeps inexplicably increasing until it reaches the limit, which is when the mentioned error message gets thrown.

Running kraken2 sequentially on these same samples is successful, however, and does not lead to the same behaviour seen when running kraken2 with sunbeam.

For that, I don't suspect it's a kraken2 issue. The issue at this point appears to be related to the way sunbeam handles kraken2.

Any thoughts?

Thanks very much for your help again! hatem

ressy commented 3 years ago

Hi hatem,

I'm a little late to this but have a couple of thoughts after catching up:

When you say kraken2 sequentially works but not kraken2 with sunbeam, do you mean running kraken2 outside of sunbeam on a single sample versus kraken2 inside of sunbeam using parallelization? If so maybe try running via sunbeam but with only one sample and as many cores as you're using in your non-sunbeam test. And what command exactly did you run when calling kraken2 directly? You can compare with what Sunbeam's rule does:

https://github.com/sunbeam-labs/sunbeam/blob/dev/rules/classify/kraken.smk#L22

And what kind of system are you running? I noticed your processes are running as root and with paths under the root filesystem so I was wondering if it's containerized/virtualized or something.

Ulthran commented 1 year ago

Hi all, I'm closing this issue for inactivity. If the problem resurfaces with the latest version of sunbeam and the sbx_kraken extension, please file a new issue under the sbx_kraken issues page.