Closed SilasK closed 7 months ago
Good comment from @LLansing moved from #668
I'm not sure if this merits opening an issue, but I thought I'd mention a problem I'm encountering with default resources + cluster execution that sounds quite similar to that encountered by @jotech
I'm using the slurm cluster profile available from the cookiecutter command suggested for use in the ATLAS docs (in the cluster execution section).
It seems default resources are overriding the resource parameter values set in the ATLAS rule definitions, although I'm not defining default resources on the command line or setting them else where.
For example, the initialize_qc rule sets mem=config["simplejob_mem"], which is 10 (Gb) in the config.yaml. From what I can tell, this mem parameter is converted to mem_mb near the end of the ATLAS Snakefile IF mem_mb is not listed in the job resources. However, snakemake submits the job with the following resources specified: resources: mem_mb=3992, disk_mb=3992, disk_mib=3808, tmpdir=
EDIT:
I've added the --default-resources mem_mb=None disk_mb=None to my ATLAS call, which removed the limiting default values that were killing my jobs.
However, with a bit of inspection-via-print() in the Snakefile section where mem is converted to mem_mb, r.resources["mem_mb"] exists, and therefore the if statement is not entered. The value of r.resources["mem_mb"] is some DefaultResources function from what I can tell (it prints as <function DefaultResources.init.
I've noticed in my investigation that many rules set the mem parameter, but some set the mem_mb directly. I've also notice that some rules set mem_mb to a config value * 1000, essentially converting the config Gb value to mb (e.g. rule download_gunc), whereas other jobs simply set mem_mb equal to the config Gb value, without multiplying to convert Gb to mb (e.g. rule run_gunc). Is there a pattern to these different methods of setting mem_mb? Why not set at rule resources mem_mb=config.
I admit my resource definition is a bit of a mess. There was not standard definition for memory in snakemake or better it was changing.
If I am not mistaken it should be mem_mib
everywhere. The config file defines memory in GB.
So there is definitively an error in rule_gunc
I checked all the rule .smk files and the only rules in which mem_mb is set to a config value without converting gb to mb are the following rules in bin_quality.smk
:
run_checkm2
run_gunc
run_busco
(although this rule has been commented outThank you very much. Do you want to make a PR or should I implement it?
Thank you very much. Do you want to make a PR or should I implement it?
I will submit a simple PR
Commenting to add to my original comment:
After adding --default-resources mem_mb=None disk_mb=None
to my ATLAS call, jobs weren't showing the mem_mb
or disk_mb
being set in the job resources in the snakemake output. Afterwards, some jobs succeeded, but others didn't, with nothing but a kill...
message in the logfile as a clue. I figured that this was the job hitting a resource restraint, and thus the default resource value was still being applied, not the value set to mem
in the majority of the workflow's rules.
I added or not isinstance(r.resources["mem_mb"], int)
to line the if statement that leads to setting mem_mb
in the Snakefile. I realize this isn't a robust solution, so I won't make a pull-request for it, but it got jobs working for me and mem_mb being set to the value intended by the rules and config.yaml.
I am having the same problem for all the individual sbatch job run parameters like, minimum memory per node
(disk_mb value?) & the JAVA -Xm{x,s}
parameters in the _rule_decontam_
. Hence, I thought it was worth mentioning.
All my jobs are getting killed due to irregular memory allocations (the JAVA -XMs & disk_mb values are always set to 102GB & 1000M respectively). I attempted to include a new parameter 'java_mem: 2'
in the config file (assuming it will take around 2GB for the same). However, the rule consistently disregards this setting.
When I run atlas with the command 'atlas run qc ... --default-resources disk_mb=None disk_mib=None mem=None mem_mb=None java_mem=None'
. Then the --mem
param from the config file is used (check log file ).
But the java_mem param takes on a value of around 153GB (ignoring the 'java_mem: 2'
param from the config file), which is roughly 85% of the 'mem=180' value (specified in the config file). But always the total memory allocated to the sbatch jobs by ATLAS is 1000M. It's not 180GB or 180GB/number of threads or jobs.
Essentially my jobs are killed because the program is allocating 1000M for the job, but the JAVA heap size gets more space than the total RAM. I seem to be stuck at this point.
Thanks for this thread/feature request btw.
Sorry @the-eon-flux I missed your comment.
--default-resources disk_mb=None disk_mib=None mem=None mem_mb=None java_mem=None
makes atlas use the mem argument from the config file and 0.85*mem for java_mem as expected.
However, the mem_mb is not set which is used by the cluster wrapper.Solution
Intermediate solutions:
hack from @LLansing
I added
or not isinstance(r.resources["mem_mb"], int)
to line the if statement that leads to setting mem_mb in the Snakefile
You can set resource arguments for specific rules
--default-resources mem_mb=80000 --set-resource rule_decontam:mem_mb=180000 rule_decontam:java_mem=150
assuming you have 80 in your config file.
Sorry @the-eon-flux I missed your comment.
- So the problem arises because your cluster profile defines --default-resources somewhere If there is a way to deactivate this this would solve it.
- java_mem: 2 in the config file has no effect.
- I think disk_mb is not the problem If I understand correctly
--default-resources disk_mb=None disk_mib=None mem=None mem_mb=None java_mem=None
makes atlas use the mem argument from the config file and 0.85*mem for java_mem as expected. However, the mem_mb is not set which is used by the cluster wrapper.Solution
- [ ] I should use mem_mb throughout the pipeline and set appropriate default-resources. This needs some testing.
Intermediate solutions:
- hack from @LLansing
I added
or not isinstance(r.resources["mem_mb"], int)
to line the if statement that leads to setting mem_mb in the Snakefile
- You can set resource arguments for specific rules
--default-resources mem_mb=80000 --set-resource rule_decontam:mem_mb=180000 rule_decontam:java_mem=150
assuming you have 80 in your config file.
@SilasK, Thank you so much for your help. I tried solution number 2 and it has started sbatch jobs for the rule _'ruledecontam' with the user-specified parameters. I had jobs crashing for this specific rule earlier so I am running the rule_decontam first. Afterward, I will see if there are problems with other rules as well.
atlas run qc --profile cluster --jobs 12 --keep-going --until run_decontamination --latency-wait 30000 --default-resources mem_mb=100000 --set-resources rule_decontam:mem_mb=180000 rule_decontam:java_mem=150
This was my atlas cmd which was wrapped in an sbatch cmd.
Fortunately, I was also successful after experimenting for some time. On our SLURM system, the following config seems to work now:
atlas run all --profile cluster --wrapper-prefix 'git+file:///path/to/snakemake-wrappers' --default-resources mem_mb=250000
I'm setting the mem_mb
variable according to the large_mem
defined in config.yaml
(in megabytes)
This thread was very helpful. Thank you very much!
I have a short follow-up question after reading about the rule-specific resource arguments.
For me, the rule eggNOG_annotation
often fails because it runs out of time. I ended up editing the file workflow/rules/genecatalog.smk
by adding time=config["runtime"]["long"]
for this rule.
Could something like this work instead?
atlas run --set-resources rule_eggNOG_annotation:runtime=24
Yes, it should, but it is.
atlas run --set-resources eggNOG_annotation:runtime=24
There is also an option to use virtual disk shm to accelerate eggNOG. And Maybe you do not want eggNOG for the genecatalog at all. There are other (Kegg, CAZY) annotations.
HI all :). This thread has been super helpful for me as I have also been having problems with really strange mem_mb and disk_mb being set.
@SilasK has this been fixed at this point or is this still a working process. I am using a slurm system and am encountering the problems above (setting the reosuces manually in the actual atlas command has fixed this and I am progressing slowly, but it would be nice to not have to do this if at all possible).
I am especially having problems with the gunc_download rule, with this setting disk_mb to a value of 1000. Even specyfying larger amounts of mem_mb and disk_mb this rule keeps failing saying I am out of space. I have checked all my space allocations on my node and there is ample room for this database. I have attached my command below that has got me through the first few steps of the pipeline :).
atlas run all \
--working-dir /rc_scratch/beyo2625/sctld_patho \
--config-file /rc_scratch/beyo2625/sctld_patho/config.yaml \
--jobs 20 \
--profile cluster \
--default-resources mem_mb=250000 \
--set-resources deduplicate_reads:mem_mb=80000 dram_download:disk_mb=100000 download_gunc:disk_mb=100000 download_gunc:mem_mb=100000
If you need anymore information please let me know and I can submit it.
I agree the default mem should be in the atlas command. The issue with disk_mem is new to me. Let's discuss in #706
There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.
Thank you for your contributions.
If default resources are defined via the command line the resource attribution in atlas fail See #668