Steady memory leak in VPA recommender

DLakin01 commented 6 months ago

Which component are you using?:

vertical-pod-autoscaler, recommender only

What version of the component are you using?

0.14.0

What k8s version are you using (kubectl version)?:

1.26

What environment is this in?:

AWS EKS, multiple clusters and accounts, multiple types of applications running on the cluster

What did you expect to happen?:

VPA recommender should run at more or less at the same memory level throughout the lifetime of a particular pod

What happened instead?:

There is a steady memory leak that is especially visible over a period of days, as seen here in a screen capture of our DataDog:

The upper lines with the steeper slope are from our large multi-tenant clusters, but the smaller clusters also experience the leak, albeit more slowly. If left alone, the memory will reach 200% of requests before the pod gets kicked. The recommender in the largest cluster is tracking 3161 PodStates at the time of creating this issue

How to reproduce it (as minimally and precisely as possible):

Not sure how reproducible the issue is outside of running VPA in a large cluster with > 3000 pods and waiting several days to see if the memory creeps up.

Anything else we need to know?:

We haven't yet created any VPA CRDs to generate recommendations, waiting until a future sprint to begin rolling those out.

vkhacharia commented 4 months ago

We also face the same issue. Our version is 0.11 with k8s version 1.24. Below is grafana snippet from the last restart.

voelzmo commented 4 months ago

Hey @vkhacharia @DLakin01 thanks for bringing this up!

To some extend, this behavior is expected and given only these graphs it is hard to tell, if the behavior is normal or not. The recommender keeps metrics for each container, regardless if that container is under VPA control or not. I guess the reasoning is that you get accurate recommendations immediately if you would decide to enable VPA for this container at a later point in time. You can switch off this default behavior by enabling memory saver mode.

Even with memory saver mode enabled, there's some grow in memory expected:

AggregateContainerState is garbage collected, whenever a sample is found to be no longer contributing to a recommendation. This check is run once per hour.
AggregateContainerStates are indexed by their name, namespace, and labelSet, so e.g. every rollout of a new Deployment version will create new AggregateContainerStates. The old AggregateContainerStates are kept for 8 days (removed only when they are found to be no longer possibly contributing, see above).

So if you're rolling approximately the same number of times per week, your memory is expected to grow for ~2 weeks. If you're adding Containers and don't have memory saver mode enabled, memory will grow with every Container.

If all of those parameters are controlled and you still see memory growth, I guess this really is a memory leak that shouldn't happen.

vkhacharia commented 4 months ago

@voelzmo Thanks for the quick response, I wanted to try it now but noticed that I am on k8s version 1.24 which has compabitility with 0.11 of vpa recommender. I dont see the parameter memory-saver in code in branch for version 0.11.

voelzmo commented 4 months ago

Hey @vkhacharia, thanks for your efforts! VPA 0.11.0 also has memory-saver mode, but the parameter is in a different place and was moved to the above section in the code with a refactoring that happened later.

So you can still turn on --memory-saver=true and see what this does for you. Hope that helps!

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 days ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kubernetes / autoscaler

Steady memory leak in VPA recommender #6368