kubernetes / autoscaler

Autoscaling components for Kubernetes
Apache License 2.0
7.96k stars 3.94k forks source link

VPA daemonset recommendations per-pod based on node metadata #5928

Open jcogilvie opened 1 year ago

jcogilvie commented 1 year ago

Which component are you using?: vertical pod autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.: Some daemonsets are comprised of pods which have variable resource needs depending on the node they run on, and by their nature they cannot horizontally scale out of this problem.

Consider the case where a kube cluster is running a cluster autoscaler that provisions all manner of different node types based on cheapest-available capacity (e.g., karpenter using AWS spot).

In the case of dramatically variable node sizes, a pod that's a member of the datadog agent daemonset will require more resources to handle an instance with more pods on it when compared to a member of that same daemonset running on a tiny instance with only a few pods.

Describe the solution you'd like.:

I would like the VPA to (optionally) provide recommendations along an extra dimension for DaemonSets, such as ENI max pods for the host, and size DS pods individually based on this dimension. The recommender might suggest a memory configuration of any given pod based on historical memory_consumed/node_max_pods instead of a single memory value across the daemonset.

Describe any alternative solutions you've considered.:

The alternative is to overprovision the daemonset by a large margin on small instances, or to limit cluster node variability.

Additional context.:

Running on AWS EKS 1.24 with Karpenter.

fbalicchia commented 1 year ago

Hi thanks for opening the issue cause we have the same need too.

At the moment of writing, I think that before starting to address the issue at VPA level is necessary to use in-place-pod-resize capability cause when we reside the pods it's restarted and we don't unknown to which node it will be bound.

Avoiding the use of scheduling setting nodeName in Pod spec can be an approach but obviously statically assigning a pod to a node is not an optimal solution cause a node can be fails or becomes unavailable. we are not verifying the available capacity on the node before assigning the pod to it.

On the other side extending the default Scheduler to address the binding pod on node can be a possible solution but I've some concerns about this approach WDYT ?

jcogilvie commented 1 year ago

Thanks for the comment @fbalicchia. I'm not an expert in this space, so I can't really speak to your suggestions.

I don't know when the affinity of a pod is determined, but I know if I inspect the pods of a daemonset they have an explicit affinity for the node on which they are intended to run. If that information is available, it might be usable for this case.

jbartosik commented 1 year ago

This is one of things I was thinking to support when we have in place support(#4016).

I'd rather do something that supports multiple similar use cases (where we have one workload with instances that have somewhat different resource requirements) and support daemonset s than do dedicated feature for daemonsets.

jcogilvie commented 1 year ago

I think that's a good goal @jbartosik. I'm having trouble thinking of how to generalize this to all deployments. Are you giving up on the idea of prediction and just deferring the decision until runtime?

One of the valuable elements of this suggestion is that you would know beforehand how big a specific pod is likely to be based on some external, measurable factor.

jbartosik commented 1 year ago

I don't have a specific proposal yet, just some ideas. Like I wrote I think this is something to take a look at after we have support for in-place updates.

We need a way to detect pods that have unusual resource usage for their deployment, waiting for the actual usage data to come is one way we could detect that. Another is using different metrics (similar to how you proposed using node size here).

bernot-dev commented 11 months ago

necessary to use in-place-pod-resize capability cause when we reside the pods it's restarted and we don't unknown to which node it will be bound.

Why is this a blocker? If a pod is resized and the rescheduled to a different node, it seems like it just needs to respect any existing affinities.

jbartosik commented 11 months ago

necessary to use in-place-pod-resize capability cause when we reside the pods it's restarted and we don't unknown to which node it will be bound.

Why is this a blocker? If a pod is resized and the rescheduled to a different node, it seems like it just needs to respect any existing affinities.

My guess is that it's something about nodes that makes resource usage of different daemon sets different (size of node, number of pods, amount of logging happening...)

So if we don't know on which node a pod will live then we don't know how much resources it will need.

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jcogilvie commented 7 months ago

/remove-lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jcogilvie commented 4 months ago

/remove-lifecycle stale

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jcogilvie commented 1 month ago

/remove-lifecycle stale