snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
9 stars 13 forks source link

Job status query fails if slurm accounting storage is disabled #38

Open prs513rosewood opened 4 months ago

prs513rosewood commented 4 months ago

I get an error when running a job on with a slurm instance whose accounting storage is disabled (i.e. the sacct command just replies Slurm accounting storage is disabled). Here's the stack trace :

The job status query failed with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --starttime 2024-02-21T16:00 --endtime now --name ddcf5013-8004-418b-8832-9a563aaf5280
Error message: Slurm accounting storage is disabled

argument of type 'NoneType' is not iterable
Traceback (most recent call last):
  File "/home/rosewood/venvs/aging_stearic/lib/python3.11/site-packages/snakemake_interface_executor_plugins/executors/remote.py", line 190, in _wait_thread
    asyncio.run(self._wait_for_jobs())
  File "/home/rosewood/stage/spack-0.21.1/opt/spack/linux-centos7-k10/gcc-8.3.1/python-3.11.6-aunbzdzafawzwjwh4wtq45ftn2zjmnzw/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/rosewood/stage/spack-0.21.1/opt/spack/linux-centos7-k10/gcc-8.3.1/python-3.11.6-aunbzdzafawzwjwh4wtq45ftn2zjmnzw/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rosewood/stage/spack-0.21.1/opt/spack/linux-centos7-k10/gcc-8.3.1/python-3.11.6-aunbzdzafawzwjwh4wtq45ftn2zjmnzw/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/home/rosewood/venvs/aging_stearic/lib/python3.11/site-packages/snakemake_interface_executor_plugins/executors/remote.py", line 180, in _wait_for_jobs
    still_active_jobs = [
                        ^
  File "/home/rosewood/venvs/aging_stearic/lib/python3.11/site-packages/snakemake_interface_executor_plugins/executors/remote.py", line 180, in <listcomp>
    still_active_jobs = [
                        ^
  File "/home/rosewood/venvs/aging_stearic/lib/python3.11/site-packages/snakemake_executor_plugin_slurm/__init__.py", line 263, in check_active_jobs
    if j.external_jobid not in status_of_jobs:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable

Looks like there's some error handling here: https://github.com/snakemake/snakemake-executor-plugin-slurm/blob/7e3de33ab447cd3415e53464019cce8e7361bda8/snakemake_executor_plugin_slurm/__init__.py#L221

But after the loop over attempts to get job status the rest of the code assumes no error and treats status_of_jobs as a valid set.

The slurm profile also uses sacct but falls back to scontrol if that fails, might be a solution : https://github.com/Snakemake-Profiles/slurm/blob/c44315217d1ce36493dc7dccbd013528657747f9/%7B%7Bcookiecutter.profile_name%7D%7D/slurm-status.py#L40

cmeesters commented 4 months ago

Thank you for this report. We definitively need to update the error message!

We had the fallback in the executor, but decided to drop it to be able to check the states in asynchronous mode with one command. A cluster without accounting db is pretty unusual. Re-introducing the fallback might not be so easy.

Is your particular cluster in an experimental stage?

prs513rosewood commented 4 months ago

Thanks for looking at this, I know this is a weird edge case. The cluster in question is somewhat artisanal.

I think the slurm cluster profile may be a workable fallback for me. And it looks like 6a197ae fixes the issue of status_of_jobs being invalid.

cmeesters commented 4 months ago

I think the slurm cluster profile may be a workable fallback for me.

Perhaps. Then again, you might want to use storage plugins and/or other plugins. That would be a mess. Is there any chance your admins set up the cluster ... eh, properly?