rhysnewell / aviary

A hybrid assembly and MAG recovery pipeline (and more!)
GNU General Public License v3.0
81 stars 12 forks source link

refine_metabat2 running out of memory #208

Open CJREID opened 3 months ago

CJREID commented 3 months ago

Hi Rhys,

I'm running aviary recover via slurm. I'm submitting the pipeline as a single job which then sends off each rule as a new job. The rule refine_metabat2 is running out of memory and being killed by slurm.

JobID           JobName  Partition      User  AllocCPUS   NNodes    Elapsed   TotalCPU      State  MaxVMSize     MaxRSS     ReqMem        NodeList 
------------ ---------- ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- --------------- 
39665747     snakejob.+       defq    rei219         64        1   07:44:52 1-00:52:56 OUT_OF_ME+                             480G            c320 
39665747.ba+      batch                              64        1   07:44:52 1-00:52:56 OUT_OF_ME+ 113578094+ 500873864K                       c320 
39665747.ex+     extern                              64        1   07:44:52  00:00.001  COMPLETED    124080K      1092K                       c320 

The first time it failed, I edited the rule so it would take 480G from my config instead of the default 128G (mem_mb = lambda wildcards, attempt: max(int(config["max_memory"])*1024, 128*1024*attempt)) but it has since failed twice with the same errors. Notably, for both the 128G and 480G runs the MaxRSS was only slightly above the requested memory. I don't have the 128G log file though as it was overwritten by the subsequent run.

Looking at the refine_metabat2.log file, Rosella takes ~ 6 hours to fail on iteration 0 before successfully running with CheckM for two more iterations in about 20 mins. I've cut the CheckM output for brevity.

refine_metabat2.log ``` INFO: 08:56:20 - Refining iteration 0 INFO: 08:56:26 - Rosella refine iteration 0 [2024-06-17T22:56:26Z INFO rosella] rosella version 0.5.3 [2024-06-17T22:56:27Z INFO rosella::refine::refinery] Reading TNF table. [2024-06-17T22:56:27Z INFO rosella::refine::refinery] Beginning refinement of 68 MAGs [2024-06-17T23:10:59Z ERROR rosella::refine::refinery] Last message from flight: Refining bins... [2024-06-17T23:10:59Z ERROR rosella::refine::refinery] Run in debug mode to get full output. [2024-06-18T06:18:04Z ERROR rosella] Refine Failed with error: Flight failed with exit code: signal: 9 (SIGKILL) INFO: 16:18:04 - CheckM iteration 0 Running CheckM on 119 bins [2024-06-18 16:18:06] INFO: CheckM v1.1.3 [2024-06-18 16:18:06] INFO: checkm lineage_wf -t 64 --pplacer_threads 48 -x fna --tab_table -f data/metabat2_refined//rosella_refined_1/refined_bins//checkm.out data/metabat2_refined//rosella_refined_1/refined_bins/ data/metabat2_refined//rosella_refined_1/refined_bins//checkm [2024-06-18 16:18:06] INFO: [CheckM - tree] Placing bins in reference genome tree. [2024-06-18 16:18:07] INFO: Identifying marker genes in 119 bins with 64 threads: ... [2024-06-18 16:27:33] INFO: QA information written to: data/metabat2_refined//rosella_refined_1/refined_bins//checkm.out [2024-06-18 16:27:33] INFO: { Current stage: 0:00:07.470 || Total: 0:09:27.247 } INFO: 16:27:34 - Ending iteration: 0 INFO: 16:27:34 - Rosella refine iteration 1 [2024-06-18T06:27:34Z INFO rosella] rosella version 0.5.3 [2024-06-18T06:27:35Z INFO rosella::refine::refinery] Reading TNF table. [2024-06-18T06:27:36Z INFO rosella::refine::refinery] Beginning refinement of 17 MAGs INFO: 16:35:18 - CheckM iteration 1 Running CheckM on 5 bins [2024-06-18 16:35:19] INFO: CheckM v1.1.3 [2024-06-18 16:35:19] INFO: checkm lineage_wf -t 64 --pplacer_threads 48 -x fna --tab_table -f data/metabat2_refined//rosella_refined_2/refined_bins//checkm.out data/metabat2_refined//rosella_refined_2/refined_bins/ data/metabat2_refined//rosella_refined_2/refined_bins//checkm [2024-06-18 16:35:19] INFO: [CheckM - tree] Placing bins in reference genome tree. [2024-06-18 16:35:19] INFO: Identifying marker genes in 5 bins with 64 threads: ... [2024-06-18 16:37:15] INFO: QA information written to: data/metabat2_refined//rosella_refined_2/refined_bins//checkm.out [2024-06-18 16:37:15] INFO: { Current stage: 0:00:00.372 || Total: 0:01:55.817 } INFO: 16:37:15 - Ending iteration: 1 INFO: 16:37:15 - Refining iteration 2 INFO: 16:37:15 - Rosella refine iteration 2 [2024-06-18T06:37:15Z INFO rosella] rosella version 0.5.3 [2024-06-18T06:37:16Z INFO rosella::refine::refinery] Reading TNF table. [2024-06-18T06:37:17Z INFO rosella::refine::refinery] Beginning refinement of 2 MAGs INFO: 16:41:02 - CheckM iteration 2 No bins to refine Skipping refinement Refinery finished. ```

Three seconds after the log suggests it has run successfully, the .err file shows that it gets killed for being OOM.

refine_metabat2.err ``` Building DAG of jobs... Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'. Using shell: /usr/bin/bash Provided cores: 64 Rules claiming more threads will be scaled down. Provided resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616 Select jobs to execute... [Tue Jun 18 08:56:16 2024] rule refine_metabat2: input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta output: data/metabat2_refined/done log: logs/refine_metabat2.log jobid: 0 benchmark: benchmarks/refine_metabat2.benchmark.txt reason: Missing output files: data/metabat2_refined/done threads: 64 resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616, tmpdir=/tmp/rei219.39665747, runtime=4320 python -c "from __future__ import print_function; import sys, json; print(json.dumps([sys.version_info.major, sys.version_info.minor]))" Activating conda environment: ../../../../../../.conda/f66cf05325a05bc6eeb3592400331d65_ python /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/scripts/tmpuokzy9yx.rosella_refine.py Activating conda environment: ../../../../../../.conda/f66cf05325a05bc6eeb3592400331d65_ Not cleaning up /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/scripts/tmpuokzy9yx.rosella_refine.py [Tue Jun 18 16:41:05 2024] Finished job 0. 1 of 1 steps (100%) done slurmstepd: error: Detected 1 oom-kill event(s) in StepId=39665747.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. ```

The data/metabat2_refined/final_bins folder has 115 refined bins from iteration 1 and 4 from iteration 2 so it seems to have more or less worked but because the job thinks it has failed after getting killed (see cluster error below - deletes some outputs of refine_metabat2), when I go to rerun the pipeline it starts the rule again and fails again after 6-7 hours.

cluster.err ``` Prior to running version 0.9.1, you will need to source: /apps/aviary-genome/0.9.1/aviary_env.sh Note: This application uses snakemake and you will need to invoke a profile to run the pipelines on our batch system See https://rhysnewell.github.io/aviary/examples and HPC submission Loading aviary-genome/0.9.1 Loading requirement: python/3.9.4 amazon-corretto/21.0.0.35.1 bbmap/39.01 parallel/20210322 miniconda3/4.9.2 pigz/2.7.0 fastani/1.33 fasttree/2.1.11 hmmer/3.4.0 gsl/2.6 capnproto-c++/0.9.1 mash/2.3 pplacer/1.1.alpha19 prodigal/2.6.3 gtdbtk/2.3.2 Warning: '/apps/aviary-genome/0.9.1/conda_pkgs' already in 'pkgs_dirs' list, moving to the top Warning: '/apps/aviary-genome/0.9.1/conda_envs' already in 'envs_dirs' list, moving to the top 06/18/2024 08:56:09 AM INFO: Time - 08:56:09 18-06-2024 06/18/2024 08:56:09 AM INFO: Command - /apps/aviary-genome/0.9.1/bin/aviary recover --assembly /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta -1 /scratch3/rei219/projects/SBM/data/processed/fastp/SBM_001_005.filt.R1.fastq.gz -2 /scratch3/rei219/projects/SBM/data/processed/fastp/SBM_001_005.filt.R2.fastq.gz -l /scratch3/rei219/projects/SBM/data/raw/seq/DNA/pacbio/S_hifi.fastq.gz --gtdb-path /scratch3/projects/datasets_bioref/aviary/gtdb --eggnog-db-path /scratch3/projects/datasets_bioref/aviary/eggnog --checkm2-db-path /scratch3/projects/datasets_bioref/aviary/checkm2db --singlem-metapackage-path /scratch3/projects/datasets_bioref/aviary/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb --tmpdir /scratch3/rei219/aviary_tmp --snakemake-profile /scratch3/rei219/.config/snakemake/slurm --conda-prefix /scratch3/rei219/.conda --semibin-model soil --skip-abundances --skip-binner vamb -z hifi -n 64 -t 64 -m 480 -o /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg 06/18/2024 08:56:09 AM INFO: Version - 0.9.1 06/18/2024 08:56:09 AM INFO: Configuration file written to /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml 06/18/2024 08:56:09 AM INFO: Executing: snakemake --snakefile /apps/aviary-genome/0.9.1/lib/python3.9/site-packages/aviary/modules/Snakefile --directory /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg --cores 64 --rerun-incomplete --keep-going --rerun-triggers mtime --configfile /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml --nolock --profile /scratch3/rei219/.config/snakemake/slurm --retries 0 --conda-frontend mamba --resources mem_mb=491520 --use-conda --conda-prefix /scratch3/rei219/.conda recover_mags Using profile /scratch3/rei219/.config/snakemake/slurm for setting default command line arguments. Building DAG of jobs... Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'. Using shell: /usr/bin/bash Provided cluster nodes: 100 Provided resources: mem_mb=491520 Job stats: job count ---------------- ------- checkm2 1 checkm_das_tool 1 das_tool 1 finalise_stats 1 gtdbtk 1 recover_mags 1 refine_dastool 1 refine_metabat2 1 singlem_appraise 1 total 9 Select jobs to execute... [Tue Jun 18 08:56:13 2024] rule refine_metabat2: input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta output: data/metabat2_refined/done log: logs/refine_metabat2.log jobid: 5 benchmark: benchmarks/refine_metabat2.benchmark.txt reason: Missing output files: data/metabat2_refined/done threads: 64 resources: mem_mb=491520, mem_mib=468750, disk_mb=4840, disk_mib=4616, tmpdir=, runtime=4320 Submitted job 5 with external jobid '39665747'. [Tue Jun 18 16:41:10 2024] Error in rule refine_metabat2: jobid: 5 input: data/metabat_bins_2/checkm.out, data/metabat_bins_2/done, data/coverm.cov, /scratch3/rei219/projects/SBM/outputs/metamdbg/S/S.fasta output: data/metabat2_refined/done log: logs/refine_metabat2.log (check log file(s) for error details) conda-env: /scratch3/rei219/.conda/f66cf05325a05bc6eeb3592400331d65_ cluster_jobid: 39665747 Error executing rule refine_metabat2 on cluster (jobid: 5, external: 39665747, jobscript: /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/tmp.uv_g70nb/snakejob.refine_metabat2.5.sh). For error details see the cluster log and the log files of the involved rule(s). Removing output files of failed job refine_metabat2 since they might be corrupted: data/metabat2_refined/done Exiting because a job execution failed. Look above for error message chmod g+w /scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/.snakemake/log/2024-06-18T085610.315172.snakemake.log Complete log: .snakemake/log/2024-06-18T085610.315172.snakemake.log 06/18/2024 04:41:21 PM CRITICAL: Command '['snakemake', '--snakefile', '/apps/aviary-genome/0.9.1/lib/python3.9/site-packages/aviary/modules/Snakefile', '--directory', '/scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg', '--cores', '64', '--rerun-incomplete', '--keep-going', '--rerun-triggers', 'mtime', '--configfile', '/scratch3/rei219/projects/SBM/outputs/aviary/recover/S_metamdbg/config.yaml', '--nolock', '--profile', '/scratch3/rei219/.config/snakemake/slurm', '--retries', '0', '--conda-frontend', 'mamba', '--resources', 'mem_mb=491520', '--use-conda', '--conda-prefix', '/scratch3/rei219/.conda', 'recover_mags']' returned non-zero exit status 1. ```

Is it likely that the first Rosella iteration failing and the excess memory requirement at the end are linked?

Thanks,

Cam

rhysnewell commented 3 months ago

Hey Cameron,

Thanks for using Aviary and apologies you've come across this issue. Seems like the refinement step crashing on the first iteration is due to out of memory issues, which then don't get properly exited until later on in the pipeline which is interesting. How big are the MAGs that are being produced by metabat2 and how many reads are you using for binning?

It might be that I underestimated how memory intensive some refinement processes can get if provided a lot of samples, but we shall see.

Cheers, Rhys

CJREID commented 3 months ago

Hey Rhys,

Thanks for getting back to me!

Metabat2 produced 397 bins, largest one being 387MB, with an average of about 5MB. I used one long read (PacBio HiFi ~2M reads ~16Gbp) and one short read dataset (originally ~80M reads, ~25Gbp) for binning. I was hoping to include more short read sets from additional timepoints at some point too. Is there too much data?

Do you think the 115 MAGs I've got out is complete or does the failed iteration mean I've missed out on some? I ended up just creating the dummy 'done' file in the metabat2_refined directory to make the rest of the pipeline run from there as I did have some MAGs.

On a side note, I also ran it with maximum memory (512G apparently reduced to ~502 in reality) and it failed OOM with the MaxRSS being less than what was requested. I know MaxRSS can be unreliable/inaccurate but still thought this was odd.

JobID           JobName  Partition      User  AllocCPUS   NNodes    Elapsed   TotalCPU      State  MaxVMSize     MaxRSS     ReqMem        NodeList 
------------ ---------- ---------- --------- ---------- -------- ---------- ---------- ---------- ---------- ---------- ---------- --------------- 
42212417     smk.refin+       defq    rei219         64        1   09:19:55 1-03:22:25 OUT_OF_ME+                          502116M            c184 
42212417.ba+      batch                              64        1   09:19:55 1-03:22:25 OUT_OF_ME+ 114018829+ 495716879K                       c184 
42212417.ex+     extern                              64        1   09:19:55  00:00.001  COMPLETED    124080K      1040K                       c184 
rhysnewell commented 3 months ago

Oh okay, that's interesting... It definitely should be handling this okay, something must be going on. Is that 387MB bin an actual bin or is it the unbinned contigs? And are there any other exceptionally large bins? 5MB is a very standard size for MAGs, but 387MB is huge and would definitely cause some issues I think.

CJREID commented 3 months ago

Ahh, yes there are quite a few very large bins. There's 33 bins larger than 10MB in the metabat_bins_2 folder. All the versions of metabat1 are also producing large bins. Not quite as big as the 387MB one but many over 10MB too.

I just did a quick checkm run on some of these large bins and unsurprisingly they're all highly contaminated (over 100% in some cases - not sure how that's possible), with genome sizes far larger than any known microbial genome.

Not really sure what the solution is here. I will delete the excessively large bins and give it another run to see if that alleviates the memory issue at least. Do you think an additional CheckM step before the first Rosella iteration could be a potential feature addition to filter excessively large and contaminated bins prior to refinement?

I can't find anything online about Metabat producing excessively large bins, but I am concerned that this is a case of garbage in, garbage out, and an issue with my data is the ultimate culprit.

rhysnewell commented 3 months ago

Still, 10MB should be absolutely fine. I'm talking more excessively large bins like 387MB one. Is it just Metabat producing these bins? Where are the contigs in these bins going when binned by the other binning programs? I would attempt just deleting those super large bins from the metabat output + deleting any of the rosella refine output and then letting it try again, I know it's annoying but might be best here.

The refine step is there to try and break apart contaminated bins, so having a step that throws out things that look contaminated kind of defeats the purpose. It's still very odd that it would cause such extreme memory issues. If you have a way of providing me the assembly, metabat2 bins, and coverage files I'd be happy to have a look at this more closely to try and figure out what is happening.

CJREID commented 3 months ago

Only metabat2 seems to be making giant ones, another sample has a 66MB bin from metabat2. The metabat1 variants make some bigger ones (63-80MB) and these rules are notably not running out of memory. Had a quick look at the rosella bins as an example and the contigs from the 387MB metabat2 bin are found in many different rosella bins as well as unbinned which also obviously a concern.

I'll try what you've suggested and see what happens. I'll send you a data link via email too.

Cheers!

rhysnewell commented 3 months ago

You can also tell aviary to skip metabat2 altogether for that particular sample as well, might be easiest to do that for now while I tinker

CJREID commented 3 months ago

To update - no issues with refine_metabat2 after removing the two mega bins (~380 and ~120MB) from that sample and running refine again. Also no issues when running the recover workflow with just a single set of short reads for binning against the long read assembly. Metabat2 did produce two large bins (96 and 43MB) on this run but no issues with refine. It seems throwing the kitchen sink at metabat2 (i.e. three sets of short reads and one set of long reads) could be at least part of the reason for these massive contaminated bins. I may end up just running each read set alone and then dereplicating the final bins from multiple runs to get around this.