martius-lab / cluster_utils

https://martius-lab.github.io/cluster_utils/
Other
8 stars 0 forks source link

Feature: Adaptive interval for job failure checks #58

Open luator opened 8 months ago

luator commented 8 months ago

The following discussion from !71 should be addressed:

luator commented 8 months ago

The purpose of this is mostly to quickly detect if something is fundamentally wrong that makes all jobs fail, right? That is, it is enough to do this only once in the beginning? I was wondering if it would make sense to reset when new jobs are submitted, but depending on the number and duration of jobs this might again lead to over-polling the system.

I'd probably start with a slightly higher value (let's say 10s) but increase a bit more slowly as in my experience so far, it sometimes take a bit until Slurm actually starts the job, so lot's of checking in the very beginning might not be that useful.

By Felix Widmaier on 2024-01-11T12:21:19 (imported from GitLab)