Closed jotech closed 1 year ago
Can you increase mem even more? simpejob_mem anc be less.
If you set mem to 120 the bbmap gets 85% of it =102.
If I set mem: 500
in the config.yaml, the error remains.
It doesn't seem to influence resource allocation at all
sacct -j 1525720
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- --------------------------------
1525720 apply_qua+ x base n155 00:01:03 OUT_OF_ME+ 0:125 billing=10,cpu=8,mem=8885M,node+
1525720.batch batch n155 00:01:03 OUT_OF_ME+ 0:125 8595640K cpu=8,mem=8885M,node=1
1525720.extern extern n155 00:01:03 OUT_OF_ME+ 0:125 28K billing=10,cpu=8,mem=8885M,node+
Ok, then there is a problem form conversion of passing the memory to the cluster. What cluster wrapper do you use, and which resource keyword does it accept?
As there was no clear standard for naming resources each rule in atlas should have three resource atributes:
mem
in gbmem_mb
mem_mib
(the new standard, as far as I can see)The only thing that can go wrong is to interpret mem as mb.
hi @SilasK I tried to investigate this a bit further.
a working rule
rule get_read_stats:
input: S221200000963/sequence_quality_control/S221200000963_raw_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_raw_R2.fastq.gz
output: S221200000963/sequence_quality_control/read_stats/raw.zip, S221200000963/sequence_quality_control/read_stats/raw_read_counts.tsv
log: S221200000963/logs/QC/read_stats/raw.log
jobid: 11
reason: Missing output files: S221200000963/sequence_quality_control/read_stats/raw.zip, S221200000963/sequence_quality_control/read_stats/raw_read_counts.tsv
wildcards: sample=S221200000963, step=raw
priority: 30
threads: 4
resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=10, java_mem=8, mem_mib=9187, time_min=300, runtime=300
sacct -j 1530622
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- --------------------------------
1530622 get_read_+ x base n107 00:02:30 COMPLETED 0:0 billing=6,cpu=4,mem=9187M,node=1
1530622.batch batch n107 00:02:30 COMPLETED 0:0 5216672K cpu=4,mem=9187M,node=1
1530622.extern extern n107 00:02:30 COMPLETED 0:0 132K billing=6,cpu=4,mem=9187M,node=1
a rule that does not work (duplicate_reads
throws the same error as apply_quality_filter
)
rule deduplicate_reads:
input: S221200000963/sequence_quality_control/S221200000963_raw_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_raw_R2.fastq.gz
output: S221200000963/sequence_quality_control/S221200000963_deduplicated_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_deduplicated_R2.fastq.gz
log: S221200000963/logs/QC/deduplicate.log
jobid: 6
benchmark: logs/benchmarks/QC/deduplicate/S221200000963.txt
reason: Missing output files: S221200000963/sequence_quality_control/S221200000963_deduplicated_R2.fastq.gz, S221200000963/sequence_quality_control/S221200000963_deduplicated_R1.fastq.gz
wildcards: sample=S221200000963
threads: 8
resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=200, java_mem=170, mem_mib=9187, time_min=300, runtime=300
sacct -j 1530620
JobID JobName User Partition NodeList Elapsed State ExitCode MaxRSS AllocTRES
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- --------------------------------
1530620 deduplica+ x base n193 00:01:18 OUT_OF_ME+ 0:125 billing=10,cpu=8,mem=9187M,node+
1530620.batch batch n193 00:01:18 OUT_OF_ME+ 0:125 9323056K cpu=8,mem=9187M,node=1
1530620.extern extern n193 00:01:18 OUT_OF_ME+ 0:125 80K billing=10,cpu=8,mem=9187M,node+
The difference between the two rules is that java_mem
is calculated one-time using simplejob_mem
(working) and the other time using mem
(not working) as defined in rules/qc.smk
.
Somehow the memory allocation when submitting the job is incorrect because the resources for the failing job are resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=200, java_mem=170, mem_mib=9187, time_min=300, runtime=300
(rule deduplicate_reads, see above).
It seems the job asks for 9187M, but the java memory for clumpify
is much higher (mem*JAVA_MEM_FRACTION
=200*0.85=170).
As far as I can see, this does not match. What do you think?
Here is some code to define all the different ways how the resources are defined https://github.com/metagenome-atlas/atlas/blob/master/workflow/Snakefile#L405C25-L405C25
Is is screwed up if mem_mb is defined by default.
I guess in your cluster profile it somehow defines the --default_resources which makes this not work.
Which profile do you use? Do you use the slurm submission directly integrated into snakemake.
The best way to fix this is to deactivate the definition of default resources in your profile.
Alternatively it might work to set:
atlas run ... --default-resources mem=200 mem_mb=200000 mem_mib=200000 java_mem=170
short follow-up here, using the resource parameters you provided seems to work!
I'm using the default cluster profile (atlas run all --profile cluster). As you suggested, there might be an issue with the resources variable on our cluster. I will reach out to our cluster support.
@SilasK I'm not sure if this merits opening an issue, but I thought I'd mention a problem I'm encountering with default resources + cluster execution that sounds quite similar to that encountered by @jotech
I'm using the slurm cluster profile available from the cookiecutter command suggested for use in the ATLAS docs (in the cluster execution section).
It seems default resources are overriding the resource parameter values set in the ATLAS rule definitions, although I'm not defining default resources on the command line or setting them else where.
initialize_qc
rule sets mem=config["simplejob_mem"]
, which is 10 (Gb) in the config.yaml. From what I can tell, this mem
parameter is converted to mem_mb
near the end of the ATLAS Snakefile IF mem_mb
is not listed in the job resources. However, snakemake submits the job with the following resources specified: resources: mem_mb=3992, disk_mb=3992, disk_mib=3808, tmpdir=<TBD>, mem=10, java_mem=8, mem_mib=3992, time_min=300, runtime=300
. Clearly the mem
parameter isn't being converted to mem_mb
, but rather default resource values are being applied from some source unknown to me.I'm currently looking for a solution to prevent snakemake's default resources from being applied, other than the --default-resources mem_mb=None
EDIT:
I've added the --default-resources mem_mb=None disk_mb=None
to my ATLAS call, which removed the limiting default values that were killing my jobs.
However, with a bit of inspection-via-print() in the Snakefile section where mem is converted to _memmb, r.resources["mem_mb"] exists, and therefore the if statement is not entered. The value of r.resources["mem_mb"] is some DefaultResources function from what I can tell (it prints as <function DefaultResources.__init__.<locals>.fallback.<locals>.callable at 0x7f34aba8cd30>
).
In the snakemake output at rule submission, it lists no value for mem_mb
anymore (it's not setting it to a default value; that's good), and mem
is set to the corresponding value in the config.yaml file. The mem value is not being converted to mem_mb as I believe is intended in the abovementioned section of the Snakefile.
I've noticed in my investigation that many rules set the mem
parameter, but some set the mem_mb
directly. I've also notice that some rules set mem_mb
to a config value * 1000, essentially converting the config Gb value to mb (e.g. rule download_gunc
), whereas other jobs simply set mem_mb
equal to the config Gb value, without multiplying to convert Gb to mb (e.g. rule run_gunc
). Is there a pattern to these different methods of setting mem_mb? Why not set at rule resources mem_mb=config.<appropriate mem>
across the board?
Thank you for your comment. lets' discuss in #676
apply_quality_filter
Here is the relevant log output:
job info
Atlas version: 2.16.2 (dev)
Additional context
config.yaml
(mem=120
,simplejob_mem=500
) didn't helpHow can I specifically increase the memory needed for
apply_quality_filter
jobs?