Unclear definition of the `--horizontal-pod-autoscaler-initial-readiness-delay` flag

Kanshiroron commented 5 years ago

Hello, In the Horizontal Pod Autoscaler documentation, the --horizontal-pod-autoscaler-initial-readiness-delay has an unclear definition and make comprehension very difficult:

Due to technical constraints, the HorizontalPodAutoscaler controller cannot exactly determine the first time a pod becomes ready when determining whether to set aside certain CPU metrics. Instead, it considers a Pod "not yet ready" if it's unready and transitioned to unready within a short, configurable window of time since it started. This value is configured with the --horizontal-pod-autoscaler-initial-readiness-delay flag, and its default is 30 seconds. Once a pod has become ready, it considers any transition to ready to be the first if it occurred within a longer, configurable time since it started. This value is configured with the --horizontal-pod-autoscaler-cpu-initialization-period flag, and its default is 5 minutes.

https://github.com/kubernetes/website/blob/master/content/en/docs/tasks/run-application/horizontal-pod-autoscale.md

It doesn't specify how does it retrieves pods status. Is it with the readiness probe?
What happen if the pods is ready before the end of the delay?
Why not configuring the delay to 0 or 1 second?
Is it affecting in any way the routing delay?

Thank you for clarifying

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Kanshiroron commented 5 years ago

/remove-lifecycle stale

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

Kanshiroron commented 5 years ago

/remove-lifecycle stale

sftim commented 5 years ago

/priority backlog

fernandrone commented 4 years ago

Yeah, would just like to add that this is really confusing :| I'm specifically interested in this:

What happen if the pods is ready before the end of the delay?

Considering both delays here: --horizontal-pod-autoscaler-cpu-initialization-period and --horizontal-pod-autoscaler-initial-readiness-delay.

Might be because I'm not a native English speaker but the paragraph seems contradictory as well?

For example:

Due to technical constraints, the HorizontalPodAutoscaler controller cannot exactly determine the first time a pod becomes ready when determining whether to set aside certain CPU metrics. Instead, it considers a Pod "not yet ready" if it's unready and transitioned to unready within a short, configurable window of time since it started. This value is configured with the --horizontal-pod-autoscaler-initial-readiness-delay flag, and its default is 30 seconds

Ok, I can kind of get that, although it's not very clear what happens in this scenario:

What if my pod is ready in the first second, does HPA sees it as ready?
And what if it becomes unready at two seconds, will it become unready?
And what if it becomes ready again at three seconds...?

Technically, that's what it says in the documentation! It says it's not yet ready only if it's unready... so I should assume that if the pod is ready, even if briefly, it will be ready, which would cause all kinds of absurd scenarios, like above.

This does not make a lot of sense to me so I assume it waits until --horizontal-pod-autoscaler-initial-readiness-delay finishes until HPA considers a pod ready, even if kubernetes considers it ready before that. But that should've been explicit in the documentation.

Ok, moving on.

Once a pod has become ready, it considers any transition to ready to be the first if it occurred within a longer, configurable time since it started. This value is configured with the --horizontal-pod-autoscaler-cpu-initialization-period flag, and its default is 5 minutes.

So, this says that any transition to ready will be the first if it occurs before --horizontal-pod-autoscaler-cpu-initialization-period. First question: what does it mean to be the first? I couldn't find what is the importance of being the first transition to ready.

Second, what happens if the pod never transitions to ready before --horizontal-pod-autoscaler-cpu-initialization-period? Say it takes 5 minutes and 1 second to become ready? To me this clearly states that to the HPA the pod never becomes ready. :thinking:

I've tried searching on google and on the kubernetes slack group and found no definitive answer to how these parameters work, although it seems many people believe --horizontal-pod-autoscaler-cpu-initialization-period sets a wait time for new pods, and prevents them from being scaled until this time passes (although I'm myself am not convinced). I'll see if I can run some tests in my clusters to at least get some ideas.

Ok, summing up my comments in the form of questions:

What happen if the pods are ready before the end of --horizontal-pod-autoscaler-initial-readiness-delay?
What happen if the pods are ready before the end of --horizontal-pod-autoscaler-cpu-initialization-period?
What is really the importance of being the "first" transition to ready, from HPAs perspective?
If my Java pod takes 3 minutes to start up and uses a lot of CPU and I want to make sure this CPU burst is not taken into account by HPA for scale up, which value should I set to 3 minutes? --horizontal-pod-autoscaler-initial-readiness-delay? --horizontal-pod-autoscaler-cpu-initialization-period? Both? Does it also matter when the readinessProbe returns successfully?

sftim commented 4 years ago

/sig autoscaling

krunalnsoni commented 4 years ago

I was trying to find the same information. The relevant code for this is here if it helps anyone: https://github.com/kubernetes/kubernetes/blob/30c9f097ca4a26dab9085832e006f09cb2993dda/pkg/controller/podautoscaler/replica_calculator.go#L392

Dafnafrank commented 4 years ago

We are also trying to figure out this issue. Would be happy to know if there are answers to the above questions.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/website/issues/12657#issuecomment-650564442): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

tengqm commented 4 years ago

/reopen /lifecycle frozen /language en The doc needs some improvement with the help from SIG autoscaling. We have got quite some votes for improving the HPA docs.

k8s-ci-robot commented 4 years ago

@tengqm: Reopened this issue.

In response to [this](https://github.com/kubernetes/website/issues/12657#issuecomment-650670961): >/reopen >/lifecycle frozen >/language en >The doc needs some improvement with the help from SIG autoscaling. We have got quite some votes for improving the HPA docs. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

lujia-zhang commented 4 years ago

@Dafnafrank Do you have any update on this issue? We're facing similar issue with hpa and would like to see if there is any solution for that.

rrajarapu commented 4 years ago

Waiting for the same. Kindly make documentation clear.

sftim commented 3 years ago

/retitle Unclear definition of the --horizontal-pod-autoscaler-initial-readiness-delay flag

sftim commented 3 years ago

It looks like the text from https://horizontal-pod-autoscaler.readthedocs.io/en/latest/user-guide/initial-readiness-delay/ could be useful to draw from.

sftim commented 3 years ago

/retitle Unclear definition of the --horizontal-pod-autoscaler-initial-readiness-delay flag

sftim commented 3 years ago

/triage accepted

zt185019 commented 3 years ago

this is still an issue. please rewrite the whole paragraph and explain in more detail. thanks

eric3chang commented 2 years ago

this is how I understand it.

the period after pod start during which readiness changes will be treated as initial readiness. This is in case the Pod goes in and out of the unready state. The code adds this delay to the pod's startTime and doesn't start looking at the readiness state until after pod startTime + initialReadinessDelay .

eric3chang commented 2 years ago

actually if you look at the code, wouldn't the initial-readiness-delay flag be used only if we are outside of the cpuInitializationPeriod ? This code and documentation is very confusing

shannonxtreme commented 2 years ago

I’m struggling a bit to parse: https://github.com/kubernetes/kubernetes/blob/30c9f097ca4a26dab9085832e006f09cb2993dda/pkg/controller/podautoscaler/replica_calculator.go#L392 mainly because I don’t think I fully get what After() is doing

This is what I understand (does it sound reasonably correct @kubernetes/sig-autoscaling-bugs @kubernetes/sig-autoscaling-misc @kubernetes/sig-scalability ?):

First check if the Pod has been acknowledged by the kubelet or has a Ready PodCondition. If yes, it’s added to the ready Pod count. If no:

Check if the value of startTime + cpuInitializationPeriod is in the future. If yes, that means the Pod is still initializing. In that case, wait until the initialization period is done and then ignore the Pod if it is still not Ready OR if the CPU metric wasn’t collected since the last time the Status changed.

If the value of startTime + cpuInitializationPeriod has already passed, ignore the Pod if it isnt currently in the Ready status and if it has not been Ready since the readiness delay period ended

For the questions in this comment:

What happen if the pods are ready before the end of --horizontal-pod-autoscaler-initial-readiness-delay?

Counts as Ready as long as the Pod is still ready after the end of the initial readiness delay.

What happen if the pods are ready before the end of --horizontal-pod-autoscaler-cpu-initialization-period?

Counts as ready. The CPU init period provides a window of time after the Pod Start Time in which the Pod has a chance to become ready

What is really the importance of being the "first" transition to ready, from HPAs perspective?

I don't think that there's an importance to the word first. I think when HPA loops and checks for the Ready state, if the Pod was Ready in the time period of CPU init period, that's all that matters

If my Java pod takes 3 minutes to start up and uses a lot of CPU and I want to make sure this CPU burst is not taken into account by HPA for scale up, which value should I set to 3 minutes? --horizontal-pod-autoscaler-initial-readiness-delay? --horizontal-pod-autoscaler-cpu-initialization-period? Both? Does it also matter when the readinessProbe returns successfully?

I'm not sure

shannonxtreme commented 2 years ago

/assign

a-mccarthy commented 2 years ago

He @shannonxtreme! can you share an update on this issue? Are you still willing to work on it?

mehabhalodiya commented 1 year ago

@shannonxtreme I don't see any updates; so unassigning you. Please feel free to assign, if you come back here again and are willing to work on 🙂 /unassign @shannonxtreme

shinebayar-g commented 1 year ago

For 4 years there haven't been any updates on that paragraph. 😢 . I wish documentation provided some concrete examples.

sftim commented 1 year ago

/retitle Unclear definition of the --horizontal-pod-autoscaler-initial-readiness-delay flag

Contributions are welcome.

boatrainlsz commented 1 year ago

it's been 4 years, I googled it a lot, no answer demystified this flag, until I saw this question which led me to here.

aude commented 1 year ago

In order to understand this flag, it could help to read and understand the original PR: https://github.com/kubernetes/kubernetes/pull/68068

matthewvalentine commented 1 year ago

I have just read through the source code, and I'm going to post what I think it does here. (This is essentially just a rephrasing of posts above, but maybe if enough people describe it in their own words, one of those will make sense to whoever is reading this in the future.)

First, neither of these settings has any effect on non-CPU metrics. For non-CPU metrics, the behavior seems to be that all Running pods are included in the calculation, regardless of Readiness.

For CPU metrics, during the cpuInitializationPeriod after pod start, a pod is included in metrics calculations if

That pod is currently Ready
AND, its most recent metrics sample only covers the time during which it was Ready. (You don't use an old sample from back when it was still Unready.)

After the cpuInitializationPeriod, a pod is included in metrics calculations if

That pod is currently Ready
OR, it was ever Ready in the past at some time after the initialReadinessDelay.

So:

No matter what these settings are, if you have a pod that is reporting Ready, and it has a metric sample from the time that it is Ready, that pod will be included in the scaling calculation. No matter how early into its startup it becomes Ready. These settings cannot be used to require it to have been Ready for a certain amount of time before being used. Instead, you have to configure the pod to not be Ready until the startup high-CPU phase is over (for example by initialDelaySeconds in the Readiness Probe).
Only cpuInitializationPeriod prevents old, Unready metrics samples from being used. So no matter how Readiness is configured, you still want cpuInitializationPeriod to be long enough to cover that startup phase.
initialReadinessDelay has no effect whatsoever if your pod never switches from Ready back to Unready. So it seems like it's only meaningful for a pod that is liable to flip-flop inconsistently between Ready and Unready during startup. (Possibly because of the high CPU usage?)

Roughly speaking, the behavior during cpuInitializationPeriod is actually what I'd expected the behavior to always be: Only Ready pods matter in scaling. But, that would be bad: If high CPU usage causes your pod to become Unready, that would not be good because it would never scale up (because all pods with high usage would be Unready and thus not included in the calculation.)

This is the source code in question:

// Pod still within possible initialisation period.
if pod.Status.StartTime.Add(cpuInitializationPeriod).After(time.Now()) {
    // Ignore sample if pod is unready or one window of metric wasn't collected since last state transition.
    unready = condition.Status == v1.ConditionFalse || metric.Timestamp.Before(condition.LastTransitionTime.Time.Add(metric.Window))
} else {
    // Ignore metric if pod is unready and it has never been ready.
    unready = condition.Status == v1.ConditionFalse && pod.Status.StartTime.Add(delayOfInitialReadinessStatus).After(condition.LastTransitionTime.Time)
}

Xieql commented 4 months ago

I have just read through the source code, and I'm going to post what I think it does here. (This is essentially just a rephrasing of posts above, but maybe if enough people describe it in their own words, one of those will make sense to whoever is reading this in the future.)

First, neither of these settings has any effect on non-CPU metrics. For non-CPU metrics, the behavior seems to be that all Running pods are included in the calculation, regardless of Readiness.

For CPU metrics, during the cpuInitializationPeriod after pod start, a pod is included in metrics calculations if

That pod is currently Ready

AND, its most recent metrics sample only covers the time during which it was Ready. (You don't use an old sample from back when it was still Unready.)

After the cpuInitializationPeriod, a pod is included in metrics calculations if

That pod is currently Ready

OR, it was ever Ready in the past at some time after the initialReadinessDelay.

So:

No matter what these settings are, if you have a pod that is reporting Ready, and it has a metric sample from the time that it is Ready, that pod will be included in the scaling calculation. No matter how early into its startup it becomes Ready. These settings cannot be used to require it to have been Ready for a certain amount of time before being used. Instead, you have to configure the pod to not be Ready until the startup high-CPU phase is over (for example by initialDelaySeconds in the Readiness Probe).

Only cpuInitializationPeriod prevents old, Unready metrics samples from being used. So no matter how Readiness is configured, you still want cpuInitializationPeriod to be long enough to cover that startup phase.

initialReadinessDelay has no effect whatsoever if your pod never switches from Ready back to Unready. So it seems like it's only meaningful for a pod that is liable to flip-flop inconsistently between Ready and Unready during startup. (Possibly because of the high CPU usage?)

Roughly speaking, the behavior during cpuInitializationPeriod is actually what I'd expected the behavior to always be: Only Ready pods matter in scaling. But, that would be bad: If high CPU usage causes your pod to become Unready, that would not be good because it would never scale up (because all pods with high usage would be Unready and thus not included in the calculation.)

This is the source code in question:
// Pod still within possible initialisation period.
if pod.Status.StartTime.Add(cpuInitializationPeriod).After(time.Now()) {
  // Ignore sample if pod is unready or one window of metric wasn't collected since last state transition.
  unready = condition.Status == v1.ConditionFalse || metric.Timestamp.Before(condition.LastTransitionTime.Time.Add(metric.Window))
} else {
  // Ignore metric if pod is unready and it has never been ready.
  unready = condition.Status == v1.ConditionFalse && pod.Status.StartTime.Add(delayOfInitialReadinessStatus).After(condition.LastTransitionTime.Time)
}

I appreciate the discussion here and after reading through the source code myself, I'd like to offer my take on it. Much like you've done, I believe that rephrasing complex concepts in our own terms can be helpful for others trying to grasp this in the future. Here's how I understand --horizontal-pod-autoscaler-cpu-initialization-period and --horizontal-pod-autoscaler-initial-readiness-delay :

Influence on CPU metric collection: Both indicators only affect the gathering of CPU metrics.
Exclusion of unready pod metrics in HPA: Metrics from pods that are considered "unready" are excluded from the current metrics, hence they do not impact HPA scaling decisions during that period.
thinks based on source code reading:
- --horizontal-pod-autoscaler-cpu-initialization-period (cpuInitializationPeriod): Throughout this interval, CPU is only collected under stringent conditions where the pod is ready and the most recent measurement is complete. To put it another way, during this period, the CPU variability of pods that are not in a ready state will not affect HPA's scaling.
- --horizontal-pod-autoscaler-initial-readiness-delay: Following the end of the cpuInitializationPeriod, this delay period permits CPU metrics collection under the more relaxed conditions of "the pod was previously in a ready state."
  To put it another way,, during this period, any CPU variability prior to the pod's initial readiness will not affect HPA scaling.

kubernetes / website

Unclear definition of the `--horizontal-pod-autoscaler-initial-readiness-delay` flag #12657