snakemake / snakemake-executor-plugin-slurm

A Snakemake executor plugin for submitting jobs to a SLURM cluster
MIT License
10 stars 14 forks source link

fix: Handle unresponsive sacct #5

Closed fgvieira closed 8 months ago

fgvieira commented 8 months ago

Fix snakemake/snakemake#2411 (reposting PR snakemake/snakemake#2413 on new repo)

When sacct is non responsive (and there is a timeout), snakemake currently exits with an error. This PR aims at properly handling the timeout by trying again. Not sure if it should wait a bit more before querying sacct again.

EDIT: some more info

The job status query failed with command: sacct -X --parsable2 --noheader --format=JobIdRaw,State --name 05969656-0e62-47f1-9008-2a189069f0a7
Error message: sacct: error: get_addr_info: getaddrinfo() failed: Name or service not known
sacct: error: slurm_set_addr: Unable to resolve "db01fl"
sacct: error: slurm_get_port: Address family '0' not supported
sacct: error: Error connecting, bad data: family = 0, port = 0
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:db01fl:6819: Resource temporarily unavailable
sacct: error: Sending PersistInit msg: Resource temporarily unavailable
sacct: error: Problem talking to the database: Resource temporarily unavailable

Traceback (most recent call last):
  File "/envs/snakemake_env/lib/python3.11/site-packages/snakemake/executors/__init__.py", line 886, in _wait_thread
    asyncio.run(self._wait_for_jobs())
  File "/envs/snakemake_env/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/envs/snakemake_env/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/envs/snakemake_env/lib/python3.11/asyncio/base_events.py", line 653, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/envs/snakemake_env/lib/python3.11/site-packages/snakemake/executors/slurm/slurm_submit.py", line 399, in _wait_for_jobs
    (status_of_jobs, sacct_query_duration) = await self.job_stati(
                                             ^^^^^^^^^^^^^^^^^^^^^
  File "/envs/snakemake_env/lib/python3.11/site-packages/snakemake/executors/slurm/slurm_submit.py", line 330, in job_stati
    return (res, query_duration)
            ^^^
UnboundLocalError: cannot access local variable 'res' where it is not associated with a value
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
fgvieira commented 8 months ago

Not sure if there is a wait time between two sacct queries, but it might be a good idea (specially if the DB is temporarily unavailable).

johanneskoester commented 8 months ago

Not sure if there is a wait time between two sacct queries, but it might be a good idea (specially if the DB is temporarily unavailable).

There is one via the rate limiter. Maybe that is sufficient for now.