Closed mimowo closed 6 months ago
/cc @tenzen-y @alculquicondor
The default base backoff should be larger than 10s to make it noticable to end-users. Make the default and potentially (TBD) configurable.
If we agree on the default of 10s we could back-port it to 0.6 branch.
I agree with this idea. I'm curious about whether we can override the backoff time with 10s when the backoff time is less than 10s. WDYT?
I agree with this idea. I'm curious about whether we can override the backoff time with 10s when the backoff time is less than 10s. WDYT?
Do you mean max(10s, computed backoff starting from 1s)
? I guess it makes the formula harder to explain to end-users (which is already hard).
Also, it would mean still 7-8 requeues with ~10s delay.
I was thinking about changing the default here, but making it configurable.
@tenzen-y any reason the base is 1.41...
? It makes rough manual calculations unnecessarily trickier.
I'm wondering if we could consider making the defaults as for pod failure backoff: 10s base and exponent of 2?
the back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.
Do you mean max(10s, computed backoff starting from 1s)? I guess it makes the formula harder to explain to end-users (which is already hard).
Also, it would mean still 7-8 requeues with ~10s delay.
I was thinking about changing the default here, but making it configurable.
That makes sense. I'm ok with replacing the Duration
with 10s.
That makes sense. I'm ok with replacing the Duration with 10s.
How about changing the exponent to 2, as per: https://github.com/kubernetes-sigs/kueue/issues/2009#issuecomment-2066596645?
@tenzen-y any reason the base is 1.41...? It makes rough manual calculations unnecessarily trickier.
I'm wondering if we could consider making the defaults as for pod failure backoff: 10s base and exponent of 2? the back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.
@mimowo As I described there, I defined the magic number so that we can make the duration estimatable.
@tenzen-y any reason the base is
1.41...
? It makes rough manual calculations unnecessarily trickier.I'm wondering if we could consider making the defaults as for pod failure backoff: 10s base and exponent of 2?
the back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six minutes.
It sounds good to me, but no restricting requeuingCount as a default value would be better. (As my understanding, the pod failure policy set 6
as a default.)
/assign
What happened:
Consider the following configuration:
And let's say we have a workload which have unschedulable pods (for whatever reason).
If a job was timed-out after 1min (or 10min) of waiting, then requeue after 1s is not relevant to end-users. it takes around 10 requeues for the backoff delay to be relevant (about 1min).
What you expected to happen:
The default base backoff should be larger than 10s to make it relevant to end-users. Make the default and potentially (TBD) configurable.
If we agree on the default of 10s we could back-port it to 0.6 branch.
How to reproduce it (as minimally and precisely as possible):
Example global configuration:
Example cluster configuration:
Example Job:
Watching the events we can see:
Anything else we need to know?:
The algorithm is described here.