nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.79k stars 633 forks source link

Loading module failure with SLURM executor #175

Closed Hammarn closed 8 years ago

Hammarn commented 8 years ago

There seems to be an issue when trying to load the bioinfo-tools module on our SLURM HPC. It worked just fine with earlier versions of Nextflow but appears to be broken in both 0.19.XX and the 0.20.0-SNAPSHOT

cat .command.env:

nxf_module_load(){
  local mod=$1
  local ver=${2:-}
  local new_module="$mod"; [[ $ver ]] && new_module+="/$ver"

  if [[ ! $(module list 2>&1 | grep -o "$new_module") ]]; then
    old_module=$(module list 2>&1 | grep -Eo "$mod\/[^\( \n]+" || true)
    if [[ $ver && $old_module ]]; then
      module switch $old_module $new_module
    else
      module load $new_module
    fi
  fi
}

nxf_module_load bioinfo-tools

nxf_module_load picard 2.0.1

cat .command.log:

Lmod has detected the following error: These module(s) exist but cannot be
loaded as requested: "picard/2.0.1"

   Try: "module spider picard/2.0.1" to see how to load the module(s).

Output from bash -x .command.env:

enviroment_snippet.txt

pditommaso commented 8 years ago

Something strange is happening here. Since #161 a module is loaded only if it's not been already loaded. This is done using the command module list.

If you check the output you have attached you will see that the the module bioinfo-tools looks already loaded (actually many times), thus the script skip the module load bioinfo-tools.

Instead module load picard/2.0.1 is correctly executed. However for some reason then the picard tool fail.

Frankly I have no idea why this is happening, and I have no way to replicate this issue. Could be a problem with the picard module definition?

Have you tried to execute the command module spider picard/2.0.1 as suggested ?

Also could you try to investigate with your sysadmins about this problem?

Hammarn commented 8 years ago

It's not a problem with the picard module. I get a similar error for all my processes. I included the bash -x output from the fastqc .command.env but it looks very similar to me. fastqc_command.env.txt

Everything seems to work fine with Nextflow 0.18.3, and when trying to load the module and running the programs manually. Thus it seems to me that most likely the error is Nextflow related. I can try and contact out sysadmins about it.

pditommaso commented 8 years ago

Could you please try to add the module list command in the .command.env generated by nextflow and execute it with bash -x. It should looks like the following code:

nxf_module_load(){
  local mod=$1
  local ver=${2:-}
  local new_module="$mod"; [[ $ver ]] && new_module+="/$ver"

  if [[ ! $(module list 2>&1 | grep -o "$new_module") ]]; then
    old_module=$(module list 2>&1 | grep -Eo "$mod\/[^\( \n]+" || true)
    if [[ $ver && $old_module ]]; then
      module switch $old_module $new_module
    else
      module load $new_module
    fi
  fi
}

module list
nxf_module_load bioinfo-tools
nxf_module_load picard 2.0.1
Hammarn commented 8 years ago

Saw that I wrote the wrong program above, that one and this one are both from TrimGalore. I added the module list as requested: trimGalore.command.env.txt

pditommaso commented 8 years ago

If you look to the produced log you can see that it does the following

You initial module environment looks like:

Currently Loaded Modules:
  1) uppmax          3) java/sun_jdk1.8.0_40   5) FastQC/0.11.5
  2) bioinfo-tools   4) picard/2.0.1

Then the following nextflow tries to load the following modules:

nxf_module_load bioinfo-tools
nxf_module_load FastQC
nxf_module_load cutadapt
nxf_module_load TrimGalore

However bioinfo-tools and FastQC are correctly skipped (i.e. the module load command is not executed for them) because the are already in the environment

Instead the following two load commands are executed as expected:

module load cutadapt/
module load TrimGalore/

Thus, I'm not understanding you are experiencing a such problem.

Do @fredericlemoine @dctrud @EricDeveaud have any suggestions?

Hammarn commented 8 years ago

I unloaded all the modules that I had loaded when trying to see that the programs and modules actually worked manually. This is the background environment that the pipeline will be run in: without_loaded_modules.txt

pditommaso commented 8 years ago

There's something strange happening when grep is executed. I'm not understanding why it prints this long output:

bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
bioinfo-tools/
:
bioinfo-tools/

Can you try to run these commands and paste here the output?

module list 
module list 2>&1 | grep -o "bioinfo-tools"
Hammarn commented 8 years ago
$ module list 

Currently Loaded Modules:
  1) uppmax   2) java/sun_jdk1.8.0_40

$ module list 2>&1 | grep -o "bioinfo-tools"
$ module load bioinfo-tools 
$ module list

Currently Loaded Modules:
  1) uppmax   2) java/sun_jdk1.8.0_40   3) bioinfo-tools
$ module list 2>&1 | grep -o "bioinfo-tools"
bioinfo-tools
pditommaso commented 8 years ago

Are you sure you executed these commands in the same node where the previous log files were produced? This output is OK. But why the previous log contains this output?

++ module list
++ grep -o bioinfo-tools
+ [[ ! -n bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools
bioinfo-tools ]]
Hammarn commented 8 years ago

So, I made a new directory, a new environment and started a new analysis from scratch and was able to run the entire pipeline to completion with the Nextflow 0.20.0-SNAPSHOT. Whatever caused this issue does seem to have been related to my environment setup somehow. Sorry for wasting your time, and thanks for the help. I'll let you know if I can figure out what caused the issue. You can probably close this issue now.

pditommaso commented 8 years ago

Good. Thanks for the feedback.