Hardcoded 8GB limit in the FilterConsensusReads processes

lbeltrame commented 4 months ago

Description of the bug

There is a hardcoded limit of 8GB in the FilterConsensusReads processes defined in modules.config. Aside being on the low side for very deep sequenced datasets (think ctDNA), it relies on retries to increase memory requirements should the task fail. However, what happens with many schedulers like SLURM is that if the task goes out of memory (and with 10,000X data it is a likely possibility) it is killed, causing an error and the premature end of the pipeline (so it won't ever be retried).

I would think this should belong to the user's definitions and should not be hardcoded.

Command used and terminal output

No response

Relevant files

No response

System information

Nextflow: 23.04.2
Hardware: HPC
Executor: slrum
Container Engine: apptainer
OS: Debian
nf-core/fastquorum version: 1.0.0

SPPearce commented 4 months ago

Do you get an empty exitcode from slurm?

lbeltrame commented 4 months ago

This is because the process is killed, and since it is piped into samtools, samtools errors out because the file is incomplete.

EDIT: Pasted the wrong log.

  fgbio \                                                                                                                         
      -Xmx8g \                                                                                                                    
      --tmp-dir=. \                                                                                                               
      --compression=0 \                                                                                                           
      FilterConsensusReads \                                                                                                      
      --input 5571767_P4.mapped.bam \                                                                                             
      --output /dev/stdout \                                                                                                      
      --ref hg38.fa \                                                                                                             
      --min-reads 1 \                                                                                                             
      --min-base-quality 20 \                                                                                                     
      --max-base-error-rate 0.1 \                                                                                                 
       \                                                                                                                          
      | samtools sort \                                                                                                           
      --threads 4 \                                                                                                               
      -o 5571767_P4.cons.filtered.bam##idx##5571767_P4.cons.filtered.bam.bai \                                                    
      --write-index \                                                                                                             
      ;                                                                                                                           

  cat <<-END_VERSIONS > versions.yml                                                                                              
  "NFCORE_FASTQUORUM:FASTQUORUM:FILTERCONSENSUSREADS":                                                                            
      fgbio: $( echo $(fgbio --version 2>&1 | tr -d '[:cntrl:]' ) | sed -e 's/^.*Version: //;s/\[.*$//')                          
  END_VERSIONS                                                                                                                    

Command exit status:                                                                                                              
  1                                                                                                                               

Command output:                                                                                
  (empty)                                                                                                       

Command error:                                                                                 
  [2024/06/04 10:38:20 | FgBioMain | Info] Executing FilterConsensusReads from fgbio version 2.0.2 as lbeltrame@node7 on JRE 17.0$
3-internal+0-adhoc..src with snappy, IntelInflater, and IntelDeflater                                   
  [2024/06/04 10:39:07 | FilterConsensusReads | Info] Filtering reads.
  [2024/06/04 10:39:08 | SamWriter | Info] Seen many non-increasing record positions. Printing Read-names as well.
  [W::bgzf_read_block] EOF marker is absent. The input may be truncated
  samtools sort: truncated file. Aborting

It doesn't tell you why, but looking at .command.log:

[2024/06/04 12:33:57 | FgBioMain | Info] Executing FilterConsensusReads from fgbio version 2.0.2 as lbeltrame@node4 on JRE 17.0.3-internal+0-adhoc..src with snappy, IntelInflater, and IntelDeflater                                                              
[2024/06/04 12:34:55 | FilterConsensusReads | Info] Filtering reads.                                                             
[2024/06/04 12:34:56 | SamWriter | Info] Seen many non-increasing record positions. Printing Read-names as well.                 
[W::bgzf_read_block] EOF marker is absent. The input may be truncated                                                            
samtools sort: truncated file. Aborting
slurmstepd-node4: error: Detected 1 oom-kill event(s) in step 647900.batch cgroup. Some of your processes may have been killed by
the cgroup out-of-memory handler.

lbeltrame commented 4 months ago

Notice that there is a similar problem with FastQC. I had to make an override to raise the memory limit to 12G to make sure it finished.

nh13 commented 2 months ago

I'd be glad to pointed where you think the config should live. Is the user config you're talking about in conf/base.config? Given the FilterConsensusReads process is a module-level config, why isn't it appropriate to have it in the conf/modules.config?

Of course, users are always free to override any of the configs in this pipeline, which is by design. I think it's reasonable that folks would need to customize the resources based on their input data and compute environment. What am I missing?

SPPearce commented 2 months ago

Hi Nils, I think the issue is that the memory is fixed at 8Gb, and doesn't increase when the module fails because it doesn't have the user retry part.

SPPearce commented 2 months ago

Fixed in #60

nf-core / fastquorum