knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

Autoscaler pod calculation when transitioning off activator drops activator metrics #3281

Closed k4leung4 closed 5 years ago

k4leung4 commented 5 years ago

In what area(s)?

/area autoscale

What version of Knative?

HEAD

0.2.x 0.3.x HEAD

Expected Behavior

When there is significant load on the activator, when it transitions from using activator metrics to pod metrics, the number of desired pod is relatively smooth.

Actual Behavior

I have seen it suggest 1000 desire pods when using activator metrics and drop down to 1 pod when switching over to pod metrics.

Steps to Reproduce the Problem

The moment the autoscaler decides to use pod metrics, it does so when the pod metrics have concurrency greater than epsilon, which leads it to suggest a desired pod count of 1, as there is a lack of metrics from pods at the time. The metrics from the activator is ignored. This leads to having to ramp up time on the metrics to have an accurate desired pod calculation https://github.com/knative/serving/blob/master/pkg/autoscaler/autoscaler.go#L151

From this issue that @mattmoor filed a while back, https://github.com/knative/serving/issues/1623, it suggested we should use, possibly with a discount on the activator metrics when calculating the desired pod count rather than completely dropping the .

@markusthoemmes Do you have any thoughts on how best to address this? Not sure if there is a greater concept on how the KPA should react in these scenarios.

I will experiment with using the activator metrics along with the pod metrics and examine the behavior and see if it is reasonable. I am also looking to add unit test to verify that when pod metrics comes in, the recommendation does not change too much.

markusthoemmes commented 5 years ago

@k4leung4 Oh yeah, good catch. I completely forgot about that.

There's #2379 to do that specifically. My "plan" was to implement #2977 first, that makes this task way easier to achieve in a "good" way I think.

With #2977 implemented we'd look at the total system concurrency at a given time (vs. over the whole window). The activator metrics would not need special treatment in that model because they simply add to the overall system concurrency (at that point in time).

WDYT? Should we attack #2977 ASAP or should we come up with a bandaid?

k4leung4 commented 5 years ago

Thanks for pointing out #2977, as this is the same problem outlined in that issue. The proposal there makes sense to me.

Do you have time to work on #2977? I can take a stab at it if you are busy.

FYI @yanweiguo, who is working on the metric scraping work.

markusthoemmes commented 5 years ago

@k4leung4 I indeed had enough cycles today to throw together #3289.

yanweiguo commented 5 years ago

2977 landed. Dose it solve this issue?

markusthoemmes commented 5 years ago

It does, thanks!

/close

knative-prow-robot commented 5 years ago

@markusthoemmes: Closing this issue.

In response to [this](https://github.com/knative/serving/issues/3281#issuecomment-468566719): >It does, thanks! > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.