snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
18 stars 19 forks source link

sacctmgr error when running in job context #164

Open jprnz opened 2 weeks ago

jprnz commented 2 weeks ago

Two things prevent the use of this plugin on our cluster:

  1. The slurm.cfg file used for the cluster is in a non-standard location
  2. Our admins prefer us to not run anything on the login node

In order to make many of the common SLURM tools work, users of our cluster need to have SLURM_CONFIG set in their environment. Since all environmental variables prefixed with 'SLURM_*' are wiped if the plugin sees SLURM_JOB_ID, this results in sacctmgr and sinfo exiting with an error:

WorkflowError:
Unable to test the validity of the given or guessed SLURM account 'xyz' with sacctmgr: sacctmgr: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sacctmgr: error: fetch_config: DNS SRV lookup failed
sacctmgr: error: _establish_config_source: failed to fetch config
sacctmgr: fatal: Could not establish a configuration source

This seems like an unintended consequence and could be easily fixed by not removing SLURM_CONFIG. The issue can be avoided by running:

unset SLURM_JOB_ID

Personally, I think it would be nice to set the values for slurm_account / slurm_partition via env vars (as srun / sbatch do) and, to me this seems like a sensible way to determine a default value.

Thanks for your work and continuing to help the community!

$ snakemake --version
8.25.1

$ mamba list | grep "snakemake-executor-plugin-slurm"
snakemake-executor-plugin-slurm 0.11.1             pyhdfd78af_0    bioconda
snakemake-executor-plugin-slurm-jobstep 0.2.1              pyhdfd78af_0    bioconda

$ sinfo --version
slurm 23.02.7
cmeesters commented 2 weeks ago

Thanks for reporting: I will look into it. Thankfully, you provided the solution along with the report. Meanwhile, I think the unsetting of SLURM variables will not help at all. According to SchedMD's documentation, the built-in variables are always exported. The solution must be in setting job parameters explicitly, always. Could you test code during a few iterations? I'm afraid, I might only get to it on Thursday or Friday, though.

NB:

Our admins prefer ...

Yes, I got this nonsense a lot. As if it hurts anyone, when someone produces a plot within a few seconds on a login node. (Or runs a workflow manager, which consumes about as much CPU power during the run of a workflow.)