martius-lab / cluster_utils

https://cluster-utils.readthedocs.io/stable/
Other
12 stars 1 forks source link

Restart jobs that are stopped because they run out of time #38

Closed luator closed 5 months ago

luator commented 1 year ago

In Slurm you have to specify the time required for your job. Jobs that exceed this time may be killed by the scheduler. If this happens, it would probably be nice if cluster_utils would detect it and optionally restart them automatically.

luator commented 10 months ago

Once !77 is merged exit_for_resume() can be used on Slurm. With this one can check the spent time in the job and exit for resume a bit before running out of time. It requires adding logic to keep track of the time but I think it is a more proper, explicit solution compared to just restarting any job that ran out of time. Thus I'll close this "wontfix". Feel free to complain if you think this would be a needed feature :).

By Felix Widmaier on 2024-01-22T15:42:21 (imported from GitLab)

luator commented 10 months ago

Reopened based on https://gitlab.tuebingen.mpg.de/mrolinek/cluster_utils/-/merge_requests/77#note_21729

By Felix Widmaier on 2024-01-29T13:51:05 (imported from GitLab)

luator commented 11 months ago

unassigned @felixwidmaier

By Felix Widmaier on 2023-12-11T13:40:58 (imported from GitLab)

luator commented 5 months ago

Not fully automatic but with adding a bit of code to the job script, this can now be done using the timeout signal (#84)