knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

The importance of the Activator in autoscaling #11862

Closed adriangudas closed 2 years ago

adriangudas commented 3 years ago

Hello folks,

knative version: 0.24

We’re seeing uneven traffic distribution between replicas in a 5-pod deployment in Knative Serving, for a service handling about 300-400 requests per second. For example, 2 out of the 3 pods will show 100 rps, while the other three will show 30 rps. Our minScale is set to 5, and we would expect to see a more even distribution of traffic between the 5 pods. Here’s an example:

Screen Shot 2021-08-24 at 1 37 45 PM

We have some questions about the activator, specifically its role once the pods have already scaled up from zero. The documentation (https://knative.dev/docs/serving/knative-kubernetes-services/) says that:

The activator is responsible for receiving & buffering requests for inactive revisions and reporting metrics to the autoscaler. It also retries requests to a revision after the autoscaler scales the revision based on the reported metrics.

Specifically, we're wondering: what functionality is missing if we remove the activator from the request path? Will this impact the autoscaler? For example, if we remove it, and there is a burst of traffic, does the activator never get put back into the request path after initial scale is achieved, resulting in dropped requests?

Our goal primarily is to reduce complexity of load balancing decisions that happen within the cluster, perhaps relying on Istio to balance traffic between pods instead of having the activator involved in every request (using the true client IP, not the Istio defaults which don’t obey X-Forwarded-For properly). Is that a reasonable goal, or is the activator’s role in autoscaling more important than we realize? Wondering if we're going in the right direction with this.

Thanks in advance! Much appreciated!

edit: just a couple of additional notes about our deployment:

vagababov commented 3 years ago

So it is all governed by the target-burst-capacity and target-utilization settings — on the KSvc or in the config map. When ksvc has less spare capacity than TBC activator will be in the path, otherwise removed (and you can always have it in the path, or never for scaled up services tbc=-1 or 0 accordingly).

If you have a sudden burst of traffic if spare capacity drops below TBC level (default of 200), then activator will be brought back into the request path. It will act as a circuit breaker of sorts to avoid overloading service pods. Activator does perform load balancing, but for CC values > 3 it will just perform 2nd best random LB and it does not honor xff headers, though it will proxy them.

For CC values less than < 3 we recommend having activator in the path to avoid queueing in the service pods, even at CC=10 some queueing is avoided though at a much smaller scale.

So if you want to remove activator from the path, which might be reasonable for CC=8, just set service tbc to 0 or a much smaller value if you still want some burst protection.

adriangudas commented 3 years ago

Thanks for the response and detailed advice. So it sounds like we don't want to eliminate the activator entirely, because we want to retain the benefits of the burst protection that it provides, but perhaps only during the times that we really need it.

I think what we're going to do, then, is try to choose a setting that will keep the activator out of the path during regular operation, with some reasonable threshold that won't be met unless things get really bursty.

If you have a sudden burst of traffic if spare capacity drops below TBC level (default of 200),

so one last question, then: can you clarify what is meant by spare capacity? Is this excess_burst_capacity as reported by the autoscaler?

Basically this would inform our decision about what to set as the TBC - I'm thinking 50 as a ballpark, but obviously we want to inform the decision with metrics from our app if possible. We just started scraping activator and autoscaler into Prometheus so we probably have this data, but I'm a bit confused by some of the metrics and which one I need to be looking at.

Thanks!

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.