kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
8.11k stars 3.98k forks source link

VPA: Add support for a custom post processor based on VPA annotations #7533

Open zmalik opened 4 days ago

zmalik commented 4 days ago

Which component are you using?: VerticalPodAutoscaler Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: Inject

Describe the solution you'd like.: I would like the VerticalPodAutoscaler (VPA) to support customization of its resource recommendations through annotations added to the VPA object. For instance:

Annotations to specify a multiplier for CPU and memory recommendations, allowing the actual recommended values to be scaled dynamically (e.g., a vpa-post-processor.kubernetes.io/kafka_multiplier: "1.5" to increase recommendations by 50%). The ability to outsource the computation of recommendations to an external service or logic, while still allowing the VPA webhook to enforce these externally computed recommendations. This would provide fine-grained control over VPA recommendations without requiring changes to the core logic or the deployment of another controller.

Describe any alternative solutions you've considered.:

Additional context.:

The primary goal is to maintain the usability of VPA while allowing flexibility in how recommendations are calculated and applied. In my case using the PSI(pressure stall information) to tune requests which the CPU usage won't reflect

zmalik commented 4 days ago

I would love to work on this feature also and would be happy to join the next SIG sync to explain the proposal in more detail, and gather feedback

adrianmoisey commented 4 days ago

/area vertical-pod-autoscaler

omerap12 commented 3 days ago

I’m curious about the reasoning behind this request—why would we want to adjust VPA’s recommendations?

VPA is designed to analyze how your application actually uses resources and provide tailored suggestions. If there’s a need to consistently increase these recommendations by 50%, it raises a couple of questions:

zmalik commented 3 days ago

hey @omerap12 valid questions!

thats why I wanted to jump on SIG sync to explain it also and get feedback.

I appreciate the capabilities of the VPA in optimizing resource allocations based on usage metrics or external metrics. However, we've encountered specific scenarios where VPA's default behavior doesn't fully meet our needs due to how the Linux kernel schedules CPU cycles and how this impacts our workloads.

Are VPA’s recommendations not meeting your needs? If so, what’s falling short?

Yes, in certain cases, VPA's recommendations are not fully addressing our performance requirements. The core issue lies in as I said how the Linux kernel allocates CPU cycles to containers and how VPA interprets usage metrics without considering the underlying CPU scheduling dynamics.

The kernel uses CFS to allocate CPU time to processes based on their CPU shares, which are basically translated from the CPU requests specified in Kubernetes. However, during periods of high CPU demand, even if a container is using only a fraction of its CPU request—say 40%—it might still not receive enough CPU cycles promptly due to the way the kernel schedules processes. This results in increased latency and reduced performance for the container's applications.

We've been utilizing Pressure Stall Information (PSI) metrics to monitor CPU scheduling delays. PSI metrics measure the time that tasks are ready to run but are stalled because CPU resources are not immediately available. High CPU PSI metrics indicate that processes are frequently waiting for CPU time, which can significantly impact application performance.

By increasing the container's CPU requests by, for example, 30%, we effectively increase its CPU shares in the kernel scheduler. This adjustment allows the kernel to allocate CPU cycles more promptly to the container, reducing the time it spends waiting and lowering the CPU PSI metrics. Consequently, the container experiences fewer scheduling delays and improved performance.

However, VPA currently bases its recommendations on observed CPU usage without accounting for these scheduling delays reflected in PSI metrics. As a result, it may recommend lower CPU requests, inadvertently causing containers to suffer from increased CPU stalls under contention.

This is not a use-case for all containers but for a subset of containers which suffer CPU contention even when having low CPU usage.

This had allowed to onboard some workload that is spiky in nature to actually work fine with vertical scaling.

Is there something unique about your use case that VPA isn’t accounting for?

I don't think we should enable VPA to directly integrate the CPU pressure metrics. To do what I do currently, I had to built a wrapper around VPA which uses VPA recommender but mutates via its own mutation webhook, after cross-checking CPU Pressure metrics. I would like to use VPA directly and let other engineers also built in open plugins for VPA that can leverage this feature.

There are few other use-cases that I have been able to leverage also, which goes from enabling HPA and VPA on same scaling dimension, I will keep that on the side as for now to not to derail from main use-case.

adrianmoisey commented 2 days ago

The kernel uses CFS to allocate CPU time to processes based on their CPU shares, which are basically translated from the CPU requests specified in Kubernetes. However, during periods of high CPU demand, even if a container is using only a fraction of its CPU request—say 40%—it might still not receive enough CPU cycles promptly due to the way the kernel schedules processes. This results in increased latency and reduced performance for the container's applications.

I just want to confirm something here. Are you talking about the scenario where the entire node is being CPU saturated? Or it it just the Pod that is being saturated?

Under normal circumstances, the scenario you describe shouldn't happen, so I'm trying to figure out what is going on at the time of this event.

zmalik commented 2 days ago

@adrianmoisey you are right. We are talking about nodes bin packed efficiently based on requests recommendations between 80% to 90%. But the usage when this happens is not necessarily near 100%.

Issue with these momentarily spikes is that also the node exporter or traditional scraping methods won't catch it. Node usage won't look above 90% but we can see but a container will start delays on on CPU cycles. It will happen when the container do not have the most shares in the node, and it has this spiky nature where it suddenly requires X cores at the same time and later after using those cycles it won't ask for those for a while.

at the same time, yes you are right this is not normal circumstances, and we have seen nodes with 95% cpu usage and no container getting stalled on CPU at all. So this depends on the nature of the workload mostly.

adrianmoisey commented 2 days ago

Oh, and another question, are you setting CPU limits at all on your Pods?

I'm still just getting context around your use case, trying to figure out what options we have moving forward.

zmalik commented 2 days ago

no limits to allow spikes to actually happen.

I do add limits (set cpu quota at cgroups layer) for most offending containers if node cpu usage goes beyond a certain threshold, but has more to do with extreme usage like 90% and beyond. But that is something totally in parallel and outside scope of VPA.

I think I was mostly thinking how I can fallback to use VPA again, without asking for another source of truth integrated inside the VPA. Also another use case is of stabilizing with HPA usage for example. Where a deployment will tend eventually towards minReplicas over time when doing recommendations based on targetCPUUtillization of HPA

adrianmoisey commented 18 hours ago

Interesting.

Part of me feels like this is operating exactly as designed.

The VPA is setting a recommendation, which sets requests. The Kubernetes scheduler is using those requests to schedule pods accordingly, but also set the CPU shares/weight for each Pod.

In a perfect world where the workload is flat, there is no problem here.

However, no workload is flat. You're trying to handle the case when there is a CPU spike.

What are the options here?

  1. Add a buffer into the system, so that when a spike does happen, the node isn't CPU saturated (Kube-reserved or system-reserved could do these things). This option isn't ideal, since you can only define a static value, and all it will do is schedule less CPU requests per node, meaning that you basically run less Pods per node and hope that a spike doesn't happen high enough
  2. Limit the spikey workloads. Keeping a limit on the workload that spikes may help keep CPU saturation down, enough that the important workloads can continue to operate.
  3. Increase requests for the important workloads. This is what you're basically asking for.

I'm wondering if the multiple recommenders could work? If the workload is important, use a recommender that has a higher target-cpu-percentile set.

For some reason the multiplier idea feels wrong to me, but I can't figure out why I feel that way. But the more I look at what the options are, it may make sense to add. I've also learnt that we already have post-processors, (ie: this one), may be it would be good to figure out if it does make sense to add this feature.