Feature: Adaptive interval for job failure checks

martius-lab / cluster_utils

Other

8 stars 0 forks source link

The following discussion from !71 should be addressed:

[ ] discussion 1: (+1 comment)
I would start with a min interval, then double this interval on every subsequent call, up to a max interval. E.g. from 5secs min to 60secs max, it would go like:
- 5s
- 10s
- 20s
- 40s
- 60s
- 60s
- ...
[ ] discussion 2: Also use it for Condor

Yes, it could make sense to share the throttling implementation with slurm_cluster_system and then just use a much higher threshold (like once a second). It's just reading a file, but the MPI cluster also does not like if you hammer it's filesystem too much.

martius-lab / cluster_utils