metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
376 stars 98 forks source link

Error in rule rule apply_quality_filter #668

Closed jotech closed 1 year ago

jotech commented 1 year ago

Error in rule apply_quality_filter:
    jobid: 44
    input: S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz, S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz, [...]dat/db/atlas/adapters.fa
    output: S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz, S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz, S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz, S221200000966/logs/S221200000966_quality_filtering_stats.txt
    log: S221200000966/logs/QC/quality_filter.log (check log file(s) for error details)
    conda-env: [...]dat/db/atlas/conda_envs/baf452aa6f8992d0d74d69497829a56b_
    shell:
         bbduk.sh in1=S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz in2=S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz  ref=[...]dat/db/atlas/adapters.fa  interleaved=f  out=S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz out2=S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz outs=S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz  stats=S221200000966/logs/S221200000966_quality_filtering_stats.txt  overwrite=true  qout=33  trd=t  hdist=1  k=27  ktrim=r  mink=8  trimq=10  qtrim=rl  threads=8  minlength=51  maxns=-1  minbasefrequency=0.05  ecco=t  prealloc=t  pigz=t unpigz=t  -Xmx102G  2> S221200000966/logs/QC/quality_filter.log
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    cluster_jobid: 1525371

Here is the relevant log output:


java -ea -Xmx102G -Xms102G -cp [...]dat/db/atlas/conda_envs/baf452aa6f8992d0d74d69497829a56b_/opt/bbmap-39.01-1/current/ jgi.BBDuk in1=S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz in2=S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz ref=[...]dat/db/atlas/adapters.fa interleaved=f out=S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz out2=S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz outs=S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz stats=S221200000966/logs/S221200000966_quality_filtering_stats.txt overwrite=true qout=33 trd=t hdist=1 k=27 ktrim=r mink=8 trimq=10 qtrim=rl threads=8 minlength=51 maxns=-1 minbasefrequency=0.05 ecco=t prealloc=t pigz=t unpigz=t -Xmx102G                                                                                                                                                                                                                                                 
Executing jgi.BBDuk [in1=S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz, in2=S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz, ref=[...]dat/db/atlas/adapters.fa, interleaved=f, out=S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz, out2=S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz, outs=S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz, stats=S221200000966/logs/S221200000966_quality_filtering_stats.txt, overwrite=true, qout=33, trd=t, hdist=1, k=27, ktrim=r, mink=8, trimq=10, qtrim=rl, threads=8, minlength=51, maxns=-1, minbasefrequency=0.05, ecco=t, prealloc=t, pigz=t, unpigz=t, -Xmx102G]
Version 39.01

Set INTERLEAVED to false
Set threads to 8
Initial size set to 1014718593
maskMiddle was disabled because useShortKmers=true
[...]dat/db/atlas/conda_envs/baf452aa6f8992d0d74d69497829a56b_/bin/bbduk.sh: line 370: 2563011 Killed                  java -ea -Xmx102G -Xms102G -cp [...]dat/db/atlas/conda_envs/baf452aa6f8992d0d74d69497829a56b_/opt/bbmap-39.01-1/current/ jgi.BBDuk in1=S221200000966/sequence_quality_control/S221200000966_deduplicated_R1.fastq.gz in2=S221200000966/sequence_quality_control/S221200000966_deduplicated_R2.fastq.gz ref=[...]dat/db/atlas/adapters.fa interleaved=f out=S221200000966/sequence_quality_control/S221200000966_filtered_R1.fastq.gz out2=S221200000966/sequence_quality_control/S221200000966_filtered_R2.fastq.gz outs=S221200000966/sequence_quality_control/S221200000966_filtered_se.fastq.gz stats=S221200000966/logs/S221200000966_quality_filtering_stats.txt overwrite=true qout=33 trd=t hdist=1 k=27 ktrim=r mink=8 trimq=10 qtrim=rl threads=8 minlength=51 maxns=-1 minbasefrequency=0.05 ecco=t prealloc=t pigz=t unpigz=t -Xmx102G

job info

sacct -j 1525371
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS                        AllocTRES 
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- -------------------------------- 
             1525371 apply_qua+  x       base            n165   00:01:04 OUT_OF_ME+    0:125            billing=12,cpu=8,mem=18246M,nod+ 
       1525371.batch      batch                                 n165   00:01:04 OUT_OF_ME+    0:125  18556844K          cpu=8,mem=18246M,node=1 
      1525371.extern     extern                                 n165   00:01:04 OUT_OF_ME+    0:125        16K billing=12,cpu=8,mem=18246M,nod+ 

Atlas version: 2.16.2 (dev)

Additional context

How can I specifically increase the memory needed for apply_quality_filter jobs?

SilasK commented 1 year ago

Can you increase mem even more? simpejob_mem anc be less.

If you set mem to 120 the bbmap gets 85% of it =102.

jotech commented 1 year ago

If I set mem: 500 in the config.yaml, the error remains. It doesn't seem to influence resource allocation at all

sacct -j 1525720
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS                        AllocTRES 
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- -------------------------------- 
             1525720 apply_qua+         x       base            n155   00:01:03 OUT_OF_ME+    0:125            billing=10,cpu=8,mem=8885M,node+ 
       1525720.batch      batch                                 n155   00:01:03 OUT_OF_ME+    0:125   8595640K           cpu=8,mem=8885M,node=1 
      1525720.extern     extern                                 n155   00:01:03 OUT_OF_ME+    0:125        28K billing=10,cpu=8,mem=8885M,node+ 
SilasK commented 1 year ago

Ok, then there is a problem form conversion of passing the memory to the cluster. What cluster wrapper do you use, and which resource keyword does it accept?

As there was no clear standard for naming resources each rule in atlas should have three resource atributes:

The only thing that can go wrong is to interpret mem as mb.

jotech commented 1 year ago

hi @SilasK I tried to investigate this a bit further.

a working rule

rule get_read_stats:                                                                                                                    
    input: S221200000963/sequence_quality_control/S221200000963_raw_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_raw_R2.fastq.gz
    output: S221200000963/sequence_quality_control/read_stats/raw.zip, S221200000963/sequence_quality_control/read_stats/raw_read_counts.tsv    
    log: S221200000963/logs/QC/read_stats/raw.log
    jobid: 11
    reason: Missing output files: S221200000963/sequence_quality_control/read_stats/raw.zip, S221200000963/sequence_quality_control/read_stats/raw_read_counts.tsv
    wildcards: sample=S221200000963, step=raw
    priority: 30
    threads: 4
    resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=10, java_mem=8, mem_mib=9187, time_min=300, runtime=300
sacct -j 1530622
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS                        AllocTRES 
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- -------------------------------- 
             1530622 get_read_+  x              base            n107   00:02:30  COMPLETED      0:0            billing=6,cpu=4,mem=9187M,node=1 
       1530622.batch      batch                                 n107   00:02:30  COMPLETED      0:0   5216672K           cpu=4,mem=9187M,node=1 
      1530622.extern     extern                                 n107   00:02:30  COMPLETED      0:0       132K billing=6,cpu=4,mem=9187M,node=1 

a rule that does not work (duplicate_reads throws the same error as apply_quality_filter)

rule deduplicate_reads:
    input: S221200000963/sequence_quality_control/S221200000963_raw_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_raw_R2.fastq.gz
    output: S221200000963/sequence_quality_control/S221200000963_deduplicated_R1.fastq.gz, S221200000963/sequence_quality_control/S221200000963_deduplicated_R2.fastq.gz
    log: S221200000963/logs/QC/deduplicate.log
    jobid: 6
    benchmark: logs/benchmarks/QC/deduplicate/S221200000963.txt
    reason: Missing output files: S221200000963/sequence_quality_control/S221200000963_deduplicated_R2.fastq.gz, S221200000963/sequence_quality_control/S221200000963_deduplicated_R1.fastq.gz
    wildcards: sample=S221200000963
    threads: 8
    resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=200, java_mem=170, mem_mib=9187, time_min=300, runtime=300
sacct -j 1530620
               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     MaxRSS                        AllocTRES 
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- -------------------------------- 
             1530620 deduplica+  x              base            n193   00:01:18 OUT_OF_ME+    0:125            billing=10,cpu=8,mem=9187M,node+ 
       1530620.batch      batch                                 n193   00:01:18 OUT_OF_ME+    0:125   9323056K           cpu=8,mem=9187M,node=1 
      1530620.extern     extern                                 n193   00:01:18 OUT_OF_ME+    0:125        80K billing=10,cpu=8,mem=9187M,node+ 

The difference between the two rules is that java_mem is calculated one-time using simplejob_mem (working) and the other time using mem (not working) as defined in rules/qc.smk.

Somehow the memory allocation when submitting the job is incorrect because the resources for the failing job are resources: mem_mb=9187, disk_mb=9187, disk_mib=8762, tmpdir=<TBD>, mem=200, java_mem=170, mem_mib=9187, time_min=300, runtime=300 (rule deduplicate_reads, see above). It seems the job asks for 9187M, but the java memory for clumpify is much higher (mem*JAVA_MEM_FRACTION=200*0.85=170).

As far as I can see, this does not match. What do you think?

SilasK commented 1 year ago

Here is some code to define all the different ways how the resources are defined https://github.com/metagenome-atlas/atlas/blob/master/workflow/Snakefile#L405C25-L405C25

Is is screwed up if mem_mb is defined by default.

I guess in your cluster profile it somehow defines the --default_resources which makes this not work.

Which profile do you use? Do you use the slurm submission directly integrated into snakemake.

The best way to fix this is to deactivate the definition of default resources in your profile.

Alternatively it might work to set:

atlas run ... --default-resources mem=200 mem_mb=200000 mem_mib=200000 java_mem=170

jotech commented 1 year ago

short follow-up here, using the resource parameters you provided seems to work!

I'm using the default cluster profile (atlas run all --profile cluster). As you suggested, there might be an issue with the resources variable on our cluster. I will reach out to our cluster support.

LLansing commented 1 year ago

@SilasK I'm not sure if this merits opening an issue, but I thought I'd mention a problem I'm encountering with default resources + cluster execution that sounds quite similar to that encountered by @jotech

I'm using the slurm cluster profile available from the cookiecutter command suggested for use in the ATLAS docs (in the cluster execution section).

It seems default resources are overriding the resource parameter values set in the ATLAS rule definitions, although I'm not defining default resources on the command line or setting them else where.

I'm currently looking for a solution to prevent snakemake's default resources from being applied, other than the --default-resources mem_mb=None


EDIT: I've added the --default-resources mem_mb=None disk_mb=None to my ATLAS call, which removed the limiting default values that were killing my jobs. However, with a bit of inspection-via-print() in the Snakefile section where mem is converted to _memmb, r.resources["mem_mb"] exists, and therefore the if statement is not entered. The value of r.resources["mem_mb"] is some DefaultResources function from what I can tell (it prints as <function DefaultResources.__init__.<locals>.fallback.<locals>.callable at 0x7f34aba8cd30>). In the snakemake output at rule submission, it lists no value for mem_mb anymore (it's not setting it to a default value; that's good), and mem is set to the corresponding value in the config.yaml file. The mem value is not being converted to mem_mb as I believe is intended in the abovementioned section of the Snakefile.

I've noticed in my investigation that many rules set the mem parameter, but some set the mem_mb directly. I've also notice that some rules set mem_mb to a config value * 1000, essentially converting the config Gb value to mb (e.g. rule download_gunc), whereas other jobs simply set mem_mb equal to the config Gb value, without multiplying to convert Gb to mb (e.g. rule run_gunc). Is there a pattern to these different methods of setting mem_mb? Why not set at rule resources mem_mb=config.<appropriate mem> across the board?

SilasK commented 1 year ago

Thank you for your comment. lets' discuss in #676