nrlab-CRUK / INVAR2

restructures version of invar
5 stars 4 forks source link

Update 1_parse.nf #4

Closed bjpop closed 2 years ago

bjpop commented 2 years ago

Add memory requirement to avoid failing in SLURM environment when memory use exceeds some default value.

Without this setting the pipeline fails on our SLURM cluster with an error:

/opt/conda/envs/invar2/bin/picard: line 66: 134483 Killed /opt/conda/envs/invar2/bin/java -Xms512m -Xmx2g -jar /opt/conda/envs/invar2/share/picard-2.26.10-0/picard.jar CreateSequenceDictionary "--REFERENCE" "human_g1k_v37_decoy.fasta" "--OUTPUT" "human_g1k_v37_decoy.dict"

As you can see the process was killed because it exceeded its memory request.

rich7409 commented 2 years ago

Thanks for the input. First, for the memory requirement, the CreateSequenceDictionary task doesn't limit the memory given to the JVM, so I've added parameters to set a maximum. I can't imagine this task needs more than the default 1G, but without a limit the garbage collector might not kick in and instead the process will claim more and more memory until the cluster system kills it. The change makes sure the JVM won't use more than the allocated memory (it actually assigns 128MB less to the JVM that given to the task, to allow overhead). I've added some text to Running.md describing how to go about defining memory requirements on a per project basis in the project's nextflow config. So if CreateSequenceDictionary still blows up, this describes how to increase the size given. Second, your change to samtools_faidx is specific to using Ensembl chromosome naming (ours was specific to UCSC). To support UCSC or Ensembl naming without having to change the work flow I've added a parameter CHROMOSOME_ID_PREFIX to nextflow.config which is just a string, but really should be either "chr" for UCSC references or the empty string ('') for Ensembl ones. The default is "chr" but again it can be added to a project's nextflow.config to set it differently. Rich.