metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
364 stars 97 forks source link

Define default resources #676

Closed SilasK closed 7 months ago

SilasK commented 1 year ago

If default resources are defined via the command line the resource attribution in atlas fail See #668

SilasK commented 1 year ago

Good comment from @LLansing moved from #668

I'm not sure if this merits opening an issue, but I thought I'd mention a problem I'm encountering with default resources + cluster execution that sounds quite similar to that encountered by @jotech

I'm using the slurm cluster profile available from the cookiecutter command suggested for use in the ATLAS docs (in the cluster execution section).

It seems default resources are overriding the resource parameter values set in the ATLAS rule definitions, although I'm not defining default resources on the command line or setting them else where.

For example, the initialize_qc rule sets mem=config["simplejob_mem"], which is 10 (Gb) in the config.yaml. From what I can tell, this mem parameter is converted to mem_mb near the end of the ATLAS Snakefile IF mem_mb is not listed in the job resources. However, snakemake submits the job with the following resources specified: resources: mem_mb=3992, disk_mb=3992, disk_mib=3808, tmpdir=, mem=10, java_mem=8, mem_mib=3992, time_min=300, runtime=300. Clearly the mem parameter isn't being converted to mem_mb, but rather default resource values are being applied from some source unknown to me. I'm currently looking for a solution to prevent snakemake's default resources from being applied, other than the --default-resources mem_mb=None

EDIT: I've added the --default-resources mem_mb=None disk_mb=None to my ATLAS call, which removed the limiting default values that were killing my jobs. However, with a bit of inspection-via-print() in the Snakefile section where mem is converted to mem_mb, r.resources["mem_mb"] exists, and therefore the if statement is not entered. The value of r.resources["mem_mb"] is some DefaultResources function from what I can tell (it prints as <function DefaultResources.init..fallback..callable at 0x7f34aba8cd30>). In the snakemake output at rule submission, it lists no value for mem_mb anymore (it's not setting it to a default value; that's good), and mem is set to the corresponding value in the config.yaml file. The mem value is not being converted to mem_mb as I believe is intended in the abovementioned section of the Snakefile.

I've noticed in my investigation that many rules set the mem parameter, but some set the mem_mb directly. I've also notice that some rules set mem_mb to a config value * 1000, essentially converting the config Gb value to mb (e.g. rule download_gunc), whereas other jobs simply set mem_mb equal to the config Gb value, without multiplying to convert Gb to mb (e.g. rule run_gunc). Is there a pattern to these different methods of setting mem_mb? Why not set at rule resources mem_mb=config. across the board?

SilasK commented 1 year ago

I admit my resource definition is a bit of a mess. There was not standard definition for memory in snakemake or better it was changing.

If I am not mistaken it should be mem_mib everywhere. The config file defines memory in GB. So there is definitively an error in rule_gunc

LLansing commented 1 year ago

I checked all the rule .smk files and the only rules in which mem_mb is set to a config value without converting gb to mb are the following rules in bin_quality.smk:

SilasK commented 1 year ago

Thank you very much. Do you want to make a PR or should I implement it?

LLansing commented 1 year ago

Thank you very much. Do you want to make a PR or should I implement it?

I will submit a simple PR

LLansing commented 1 year ago

Commenting to add to my original comment:

After adding --default-resources mem_mb=None disk_mb=None to my ATLAS call, jobs weren't showing the mem_mb or disk_mb being set in the job resources in the snakemake output. Afterwards, some jobs succeeded, but others didn't, with nothing but a kill... message in the logfile as a clue. I figured that this was the job hitting a resource restraint, and thus the default resource value was still being applied, not the value set to mem in the majority of the workflow's rules.

I added or not isinstance(r.resources["mem_mb"], int) to line the if statement that leads to setting mem_mb in the Snakefile. I realize this isn't a robust solution, so I won't make a pull-request for it, but it got jobs working for me and mem_mb being set to the value intended by the rules and config.yaml.

the-eon-flux commented 12 months ago

I am having the same problem for all the individual sbatch job run parameters like, minimum memory per node (disk_mb value?) & the JAVA -Xm{x,s} parameters in the _rule_decontam_. Hence, I thought it was worth mentioning.

All my jobs are getting killed due to irregular memory allocations (the JAVA -XMs & disk_mb values are always set to 102GB & 1000M respectively). I attempted to include a new parameter 'java_mem: 2' in the config file (assuming it will take around 2GB for the same). However, the rule consistently disregards this setting.

When I run atlas with the command 'atlas run qc ... --default-resources disk_mb=None disk_mib=None mem=None mem_mb=None java_mem=None'. Then the --mem param from the config file is used (check log file ).

But the java_mem param takes on a value of around 153GB (ignoring the 'java_mem: 2' param from the config file), which is roughly 85% of the 'mem=180' value (specified in the config file). But always the total memory allocated to the sbatch jobs by ATLAS is 1000M. It's not 180GB or 180GB/number of threads or jobs.

Essentially my jobs are killed because the program is allocating 1000M for the job, but the JAVA heap size gets more space than the total RAM. I seem to be stuck at this point.

Thanks for this thread/feature request btw.

SilasK commented 11 months ago

Sorry @the-eon-flux I missed your comment.

Solution

Intermediate solutions:

  1. hack from @LLansing

    I added or not isinstance(r.resources["mem_mb"], int) to line the if statement that leads to setting mem_mb in the Snakefile

  2. You can set resource arguments for specific rules

    --default-resources mem_mb=80000 --set-resource rule_decontam:mem_mb=180000 rule_decontam:java_mem=150

    assuming you have 80 in your config file.

the-eon-flux commented 11 months ago

Sorry @the-eon-flux I missed your comment.

  • So the problem arises because your cluster profile defines --default-resources somewhere If there is a way to deactivate this this would solve it.
  • java_mem: 2 in the config file has no effect.
  • I think disk_mb is not the problem If I understand correctly --default-resources disk_mb=None disk_mib=None mem=None mem_mb=None java_mem=None makes atlas use the mem argument from the config file and 0.85*mem for java_mem as expected. However, the mem_mb is not set which is used by the cluster wrapper.

Solution

  • [ ] I should use mem_mb throughout the pipeline and set appropriate default-resources. This needs some testing.

Intermediate solutions:

  1. hack from @LLansing

I added or not isinstance(r.resources["mem_mb"], int) to line the if statement that leads to setting mem_mb in the Snakefile

  1. You can set resource arguments for specific rules
--default-resources mem_mb=80000 --set-resource rule_decontam:mem_mb=180000 rule_decontam:java_mem=150

assuming you have 80 in your config file.

@SilasK, Thank you so much for your help. I tried solution number 2 and it has started sbatch jobs for the rule _'ruledecontam' with the user-specified parameters. I had jobs crashing for this specific rule earlier so I am running the rule_decontam first. Afterward, I will see if there are problems with other rules as well.

atlas run qc --profile cluster --jobs 12 --keep-going --until run_decontamination --latency-wait 30000 --default-resources mem_mb=100000 --set-resources rule_decontam:mem_mb=180000 rule_decontam:java_mem=150

This was my atlas cmd which was wrapped in an sbatch cmd.

jotech commented 11 months ago

Fortunately, I was also successful after experimenting for some time. On our SLURM system, the following config seems to work now:

atlas run all --profile cluster --wrapper-prefix 'git+file:///path/to/snakemake-wrappers' --default-resources mem_mb=250000

I'm setting the mem_mb variable according to the large_mem defined in config.yaml (in megabytes)

This thread was very helpful. Thank you very much!

jotech commented 11 months ago

I have a short follow-up question after reading about the rule-specific resource arguments.

For me, the rule eggNOG_annotation often fails because it runs out of time. I ended up editing the file workflow/rules/genecatalog.smk by adding time=config["runtime"]["long"] for this rule.

Could something like this work instead?

atlas run --set-resources rule_eggNOG_annotation:runtime=24
SilasK commented 11 months ago

Yes, it should, but it is.

atlas run --set-resources eggNOG_annotation:runtime=24

There is also an option to use virtual disk shm to accelerate eggNOG. And Maybe you do not want eggNOG for the genecatalog at all. There are other (Kegg, CAZY) annotations.

benyoung93 commented 9 months ago

HI all :). This thread has been super helpful for me as I have also been having problems with really strange mem_mb and disk_mb being set.

@SilasK has this been fixed at this point or is this still a working process. I am using a slurm system and am encountering the problems above (setting the reosuces manually in the actual atlas command has fixed this and I am progressing slowly, but it would be nice to not have to do this if at all possible).

I am especially having problems with the gunc_download rule, with this setting disk_mb to a value of 1000. Even specyfying larger amounts of mem_mb and disk_mb this rule keeps failing saying I am out of space. I have checked all my space allocations on my node and there is ample room for this database. I have attached my command below that has got me through the first few steps of the pipeline :).

atlas run all \
--working-dir /rc_scratch/beyo2625/sctld_patho \
--config-file /rc_scratch/beyo2625/sctld_patho/config.yaml \
--jobs 20 \
--profile cluster \
--default-resources mem_mb=250000 \
--set-resources deduplicate_reads:mem_mb=80000 dram_download:disk_mb=100000 download_gunc:disk_mb=100000 download_gunc:mem_mb=100000

If you need anymore information please let me know and I can submit it.

SilasK commented 9 months ago

I agree the default mem should be in the atlas command. The issue with disk_mem is new to me. Let's discuss in #706

github-actions[bot] commented 7 months ago

There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.