`SLURM_*` environment variables can interfere with execution of MPI workloads

ocaisa commented 5 months ago

I have seen that if launching a workflow from within the context of a SLURM job (and therefore with plenty of SLURM_* environment variables set), this can interfere with the correct execution of an MPI application. In my case, I only observed this when executing across multiple nodes from within JupyterHub (where the Jupyter environments run on a compute node). I found that unsetting all the SLURM variables before running snakemake allows the workflow to complete correctly. To do this, you can use:

# Unset all environment variables that start with SLURM_
for var in $(printenv | grep '^SLURM_' | awk -F= '{print $1}'); do unset $var; done

before running your snakemake commands.

I'm not sure if this is something that could/should be fixed here or whether it should be resolved within the cluster environment, but I put it here in case someone else comes across the issue.

cmeesters commented 5 months ago

I'm not sure, whether this is an issue of snakemake. Consider, that the default SLURM behaviour is to inherit the environment of your local set up (e.g. on login or head nodes) for your jobs. This will likely be the case for your Jupyter-Node, too.

In other words: It's a quirk of your environment.

cmeesters commented 5 months ago

PS Can you start working without already being in a job-context? E.g. start your notebook on a login-node?

ocaisa commented 5 months ago

Yes, not expecting a fix here (so I will close this). I was mainly reporting it in case someone else comes across the same issue. I can indeed start the workflow from a login node, but that would be outside the JupyterLab context.

snakemake / snakemake-executor-plugin-slurm

`SLURM_*` environment variables can interfere with execution of MPI workloads #22