snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
13 stars 17 forks source link

Running snakemake in a SLURM job context #113

Closed visze closed 2 weeks ago

visze commented 1 month ago

In our HPC environment I am not alowed to run anything on the login-nodes (kicked out, banned...). So I have to run snakemake ALWAYS with sbatch or using an interactive session via srun.

Therefore I always get the warning

You are running snakemake in a SLURM job context. This is not recommended, as it may lead to unexpected behavior.Please run Snakemake directly on the login node.

And the warning is correct. Because especially the number of threads are not correctly set.

Any idea how to run snakmake within a SLURM job properly? Slurm interplayed perfectly with snakemake 7 using --cluster --clsuter-cancel and so on. But now with snakemake 8 I am forced to run it with this plugin which does not work in our environment...

cmeesters commented 1 month ago

Thing is: If you run sbatch / srun you give a parameterization (e.g. memory, cpus) and an environment (and this gets inherited). Hence, the warning.

We specifically designed the plugin, to trigger few status checks and to run only selected rules (localrules), e.g. for plotting or download, which do not perform heavy computations on the login node. Otherwise, Snakemake is practically dormant. I am happy to exchange with your admins (no strings attached!). However, I am currently leaving for a holiday.

Frankly, "I am not alowed to run anything on the login-nodes (kicked out, banned...)" sounds really harsh. To produce a plot with a few seconds of CPU time will hardly get notices. Such rules are to prevent impairing the work of others, not to hinder science.

visze commented 1 month ago

Well that's how it is and I totally understand the behavior. Just imagine 200 users run small local rules on the login node...

Also not the complete file system is visible. So the only way I can run Snakemake is after hopping to a node using an interactive session.

visze commented 1 month ago

Maybe it is possible to overwrite environment vars using the resources for the rule?

P.s.: happy holidays. Nothing urgent here

cmeesters commented 1 month ago

Just imagine 200 users run small local rules on the login node...

I do. And? ;-) If they don't "really" start to calculate, the system will cope. That what Linux is designed for.

Also not the complete file system is visible.

That is most inconvenient for any kind of work. I am sure that there are alternative solutions.

Maybe it is possible to overwrite environment vars using the resources for the rule?

I will look into it - and keep the issue open.

brisk022 commented 1 month ago

The main process may also need to create containers and/or conda environments. Depending on the complexity, it may require a lot of CPU time and a large chunk of memory. In addition, the main process needs to build a DAG and potentially rebuild it later. That task might be quite resource-intensive too. While the environments can be created beforehand with --conda-create-envs-only (if the users remember), there is no easy fix for DAG building. I managed to create a pipeline requiring more than 4 GB of memory in the past. That was years ago, so perhaps algorithms got more efficient since then.

People tend to forget or to confuse the environments. We had to introduce restrictions (1 GB of memory, 10 minutes CPU time) after login nodes went down in flames a couple of times. So, we do not kick/ban users as was the case with OP but the users can still kick themselves. With those restrictions, some pipelines fail to run on login nodes. So, we have always recommended to run them as jobs. It does seem wasteful because, as you said, the main process is dormant most of the time. But the only other alternative I see is to try it on the login node and submit it as a job if it exceeds the limits.

I do not quite understand the parametrisation and inheritance comment. The environment is inherited either way and the resource requirements are typically overridden. When does it become a problem?

cmeesters commented 2 weeks ago

The executor has to export the environment - how else can snakemake find itself or other software defined in the environment? However, SLURM variables get exported, too. Please test PR #137 . Does it allow correct submission? It removes SLURM* variables within the snakemake process. Further suggestions on how to solve this issue are welcome.

tbigot commented 2 weeks ago

I had the same issue as described: the scripts that launch my pipeline are designed to run in a SLURM job, and the threads were always set to 1. I've just tested this PR, and I can confirm it solved the problem.

cmeesters commented 2 weeks ago

I will run a few more tests tomorrow (there is no chance for in-job submission in our CI-pipeline) and will ready it for review, then.

visze commented 2 weeks ago

I tried the PR on a fresh snakemake installation:

mamba create -n snakemake_slurm_test snakemake pip
pip install git+https://github.com/snakemake/snakemake-executor-plugin-slurm.git@feat/in_job_stability

Then I run snakemake using the slurm plugin within an interactive session (srun -i bash) with one CPU:

    JOBID PARTITION NAME                           ST         TIME NODES CPUS MIN_MEMORY NODELIST(REASON)
   8969734 medium    bash                            R     18:34:51     1    1        30G      hpc-cpu-109

I get the following error when just running snakemake (version 8.18.1). When running snakemake before installing the plugin it works fine!

Traceback (most recent call last):
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/bin/snakemake", line 6, in <module>
    from snakemake.cli import main
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/lib/python3.12/site-packages/snakemake/cli.py", line 22, in <module>
    from snakemake.api import (
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/lib/python3.12/site-packages/snakemake/api.py", line 50, in <module>
    from snakemake.workflow import Workflow
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/lib/python3.12/site-packages/snakemake/workflow.py", line 65, in <module>
    from snakemake.scheduler import JobScheduler
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/lib/python3.12/site-packages/snakemake/scheduler.py", line 27, in <module>
    registry = ExecutorPluginRegistry()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/lib/python3.12/site-packages/snakemake_interface_common/plugin_registry/__init__.py", line 31, in __init__
    self.collect_plugins()
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/lib/python3.12/site-packages/snakemake_interface_common/plugin_registry/__init__.py", line 77, in collect_plugins
    module = importlib.import_module(moduleinfo.name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/cephfs-1/work/groups/kircher/users/schubacm_c/miniforge3/envs/snakemake_slurm/lib/python3.12/site-packages/snakemake_executor_plugin_slurm/__init__.py", line 29, in <module>
    from utils import delete_slurm_environment
ModuleNotFoundError: No module named 'utils'
CarstenBaker commented 2 weeks ago

I changed this line in init.py : from utils import delete_slurm_environment to from .utils import delete_slurm_environment

to use the relative utils

cmeesters commented 2 weeks ago

Ah, sorry guys, I am a little bit stressed and forgot to commit my fixes. The CI is now all green and I hope, I can test myself within the hour.

CarstenBaker commented 2 weeks ago

Not a problem and honestly don't stress it, let us know if you want us to test anything or take some of the strain ?

Just to add +1 on running from compute nodes rather than head/login node, we have a similar setup. We also run nextflow at times and found that caused chaos on the head node.

Being able to run a minimal sbatch script (1cpu/100mb) to launch snakemake would help keep things simpler and cleaner for us when helping users. Less chance of them accidentality running applications on headnode or building apptainer images and keeps it in line with nextflow setup for running workflows. Although a slight waste of resource, having a job running with minimal specs to control snakemake still seems preferable than having users attached to login node while running (either via tmux or sometimes by leaving devices connected), it also keeps the processes and attached users on login node lower.

IF users know what they are doing then running from headnode is fine, but for testing and new users running on compute nodes (whether sbatch or srun) seems a lot safer and less headaches for us.

cmeesters commented 2 weeks ago

Interesting. I can only imaging, that snakemake wants to execute everything locally (hence on the login node), when things aren't configured correctly.

Else localrules only run locally and this is some download or plotting. Boring stuff, without CPU load or I/O contention hassles.

It gives me an idea for long term development to implement a kind of safe-guard.

CarstenBaker commented 2 weeks ago

That's it exactly, if configured correctly it's not an issue but sadly things tend to go astray. The new update works a lot better than previously in srun and sbatch, it's allocating and reporting correctly everywhere apart from in the slurm logs (think this is the same as dry-run issue with threads).

At the moment I am setting threads and cpus_per_task in the profile (different in this example for testing reasons), is there a way of combining to only set this once in the profile/config ?

If I set threads it doesn't allocate enough resources in slurm, if I just set cpus_per_task it doesn't fill in threads? Not sure if there was a method I am missing?

Snakefile

rule all:
        input:
                "output/123.csv",
                "output/987.csv"

rule createtxt:
        output:
                "output/{num}.txt"
        shell: '''
                echo "Threads = {threads}"
                echo {wildcards.num}  > {output}
        '''

rule createcsv:
        input:
                "output/{num}.txt"
        output:
                "output/{num}.csv"
        shell: '''
                echo "Threads = {threads}"
                echo {wildcards.num}  > {output}
        '''

slurm_profile/config.v8+.yaml

jobs: 100
default-resources:
    slurm_partition: defq
    slurm_extra: "'-o smk.%j.out -e smk.%j.err'"
printshellcmds: True
set-threads:
        createtxt: 4
set-resources:
        createtxt:
            mem_mb_per_cpu: 500
            cpus_per_task: 6

running with: snakemake --profile slurm_profile

visze commented 2 weeks ago

I was now able to test the PR.

I started an interactive session usining srun -i batch and allocating only 1 CPU:

     JOBID PARTITION NAME                           ST         TIME NODES CPUS MIN_MEMORY NODELIST(REASON)
   8969734 medium    bash                            R   1-21:31:07     1    1        30G      hpc-cpu-109 

Then I run snakemake with the slurm excecutor plugin and my rule gets 30 threads (part of the profile):

set-threads:
  assignment_mapping_bwa: 30

Interestingly the job just gets 2 CPUs instead of 30! That was the same behaviour before. So nothing improved on my side.

 JOBID PARTITION NAME                           ST         TIME NODES CPUS MIN_MEMORY NODELIST(REASON) USER                                  COMMENT
8986188 medium    8302f233-44d1-4ce9-84a6-ff80ef  R         0:00     1    2      9537M      hpc-cpu-198 schubacm_c rule_assignment_mapping_bwa_wildcards_as

I will try now running snakemake from an sbatch commit to see if something changes. But I am not so optimistic.

EDIT: does also not work. When submitting also the snakemake job from the login node

CarstenBaker commented 2 weeks ago

If you try setting resources for the rule as well (also keep the threads), does it then use 30 cpu ? This is what I am currently getting, have to set resources and threads in the profile/config file:

set-resources:
        assignment_mapping_bwa:
            cpus_per_task: 30

Or change the rule to use resources instead of threads and just set resources in config.

resources:
        cpus_per_task=30
        shell: 
                'echo cpus = {resources.cpus_per_task}'

Not sure of the best way to tie them together without breaking local workflows or requiring a rewrite.

visze commented 2 weeks ago

I will try it and let you know. But in general this will be a drawback when threads are defined dynamically. Is there an option to set cpus_per_task to threads

cmeesters commented 2 weeks ago

We import a function from the jobstep-Executor:

def get_cpus_per_task(job: JobExecutorInterface):
    cpus_per_task = job.threads
    if job.resources.get("cpus_per_task"):
        if not isinstance(cpus_per_task, int):
            raise WorkflowError(
                f"cpus_per_task must be an integer, but is {cpus_per_task}"
            )
        cpus_per_task = job.resources.cpus_per_task
    # ensure that at least 1 cpu is requested
    # because 0 is not allowed by slurm
    return max(1, cpus_per_task)

Read: If you define threads or take a 3rd party workflow where threads is defined (or preferably read from the configuration), cpus_per_task is translated accordingly from threads. If, you choose to just read cpus_per_task from your parametrization file, this is also fine. You can use cpus_per_task to superseed threads, because threads is often hard coded and intended for local execution, whereas on a cluster you might achieve better scalability.

CarstenBaker commented 2 weeks ago

Think this is the part that's currently not working. If you don't specify cpus_per_task >= threads it limits you to the maximum number of threads defined in the launch sbatch/srun command.

Works fine from head node (correct number of threads/cpus), but if you use an sbatch script with:

#SBATCH --cpus-per-task=4
snakemake --profile slurm_profile

The maximum number of threads it allows in any snakemake job will be 4, even though they can launch on different nodes and threads total is defined as higher in profile.

set-threads:
        createtxt: 8

We will probably just stick with duplicating threads and cpus_per_task in the config file for the moment rather than rewriting the script to using cpus_per_task as a resource, otherwise it will break it for local (non slurm) runs.

visze commented 2 weeks ago

I totally agree with you @CarstenBaker. It works with cpus_per_task but not when you only define threads

job.threads seems to be the threads defined by to main SLURM job (where snakemake is running) not the one from the rule

CarstenBaker commented 2 weeks ago

If you manually set SLURM_CPUS_PER_TASK on srun to be higher and then launch snakemake it will pick up the threads correctly (obviously still assuming you set it higher than threads requested).

export SLURM_CPUS_PER_TASK=10

But wouldn't recommend it as seems a bit of a hack, more just for a test.

cmeesters commented 2 weeks ago

What do you mean by "currently"? Even using the new PR?

CarstenBaker commented 2 weeks ago

Sadly yes, installed the new PR fresh this afternoon after your message (all my examples are with new PR) Tried in conda and also with a standard venv (below):

# using python/3.12.1
python -m venv snakemake_8_slurmfix
source snakemake_12_slurmfix/bin/activate
pip install --upgrade pip
pip install snakemake
pip install git+https://github.com/snakemake/snakemake-executor-plugin-slurm.git@feat/in_job_stability

Still seems to be using SLURM variables? Is there another method you used to install, although looking in the venv the updated files are there so assume they are working correctly.

CarstenBaker commented 2 weeks ago

I think I understand what the delete_slurm_environment() is trying to do, it's basically just removing the SLURM env commands.

I think the unset command is failing for me, if I try it manually in Python 3.12.1 it doesn't unset the env.

For example, if I srun with 6 CPUS_PER_TASK and try unset I get this, if I use pop instead it removes it (not sure if unset has changed, my python skills are greatly lacking!)

>>> import os
>>> print(os.environ.get('SLURM_CPUS_PER_TASK'))
6
>>> os.unsetenv('SLURM_CPUS_PER_TASK')
>>> print(os.environ.get('SLURM_CPUS_PER_TASK'))
6
>>> print(os.environ.pop('SLURM_CPUS_PER_TASK'))
6
>>> print(os.environ.get('SLURM_CPUS_PER_TASK'))
None

However, even if I unset SLURM_CPUS_PER_TASK manually and remove any reference to 6 cpus in env the snakemake job still won't go greater than the defined number of cpu's from sbatch/srun (not sure where else it picks up this number from?)