Performance issues when using VPA resource recommendations

koflerm commented 5 years ago

We are using the Vertical Pod Autoscaler in Update Mode "off" or rather said only for the recommendations in the VPA resources. When customers apply the resource recommendations they face performance issues and OOM events even though we set the lower bound for CPU recommendations to 150m. We thought a possible solution would be to decrease the metrics server resolution time because it is currently set to 30 seconds which may not detect CPU and memory peaks.

What should be our next steps?

kgolab commented 5 years ago

Hi there,

could you please let us know a little bit more about the problem? Things that are of interest include:

how long the workload lives before a problematic recommendation is provided?
how long the VPA object lives before the recommendation is provided?
how stable (vs spiky) resource usage of your workload is?
what kind of workload is it (stateless vs stateful; what kind of programming language)?
are you using a hosted VPA or your own installation?

Another thing that might be helpful is to check verticalpodautoscalercheckpoint object related to the VPA object, it will contain the histogram data which might sched some light at what happened.

bskiba commented 5 years ago

Additional question, which version of VPA are you using?

koflerm commented 5 years ago

We are using Vertical Pod Autoscaler Version 0.5.0 hosted in our OpenShift cluster. The following routine is used for our resource recommendations: a cronjob is executed every hour which creates VPA objects for every scalable resource in the customer projects. Our "reporter" sends an email to the customers including their resource recommendations on wednesday. These emails are only send if the recommendations strongly differ from the current resource assignment and if the VPA object has been created 24h ago or more to ensure the recommendations are valid.

kgolab commented 5 years ago

Assuming that the cronjob creates VPA objects only for new resources (ones that don't have VPA object yet), I don't see anything fundamentally wrong with this scenario - manual actuation based on dry-run is a valid use-case.

Few additional questions, building on top of the initial set:

is the problem with too-low a recommendation a common one or rather a sporadic issue?
how many containers do the Pods have (the ones where the problem occurs)?
could you please double-check what recommendation was given vs what was manually applied? I know it sounds obvious but might be worth double-checking
is there any significant change in the load after applying recommendations (like maybe someone turns on another application afterwards)?
what is the usual utilisation of resources on Node (both actual usage & sum of requests)?

I understand that some of the questions might be not easy to answer or the answers might leak more data than you'd want to share. On the other hand the more information we have, the more we're likely to pinpoint a possible cause so please share as much as you easily can. In particular the numerical data from verticalpodautoscalercheckpoint might sched more light, akin to #1182.

koflerm commented 5 years ago

As far as we know 2 Projects in our cluster face this issue. The problem is we do not know how many customers actually applied the recommendations.
usually 1 container per pod
Checked it, a.t.m. more resources then recommended are assigned because we cannot risk a downtime
No there isn't
Node resources: CPU requested: 22%, CPU utilization: 13%, memory requested: 89%, memory utilization: 9%

The current checkpoint from an instance where OOM events occured:

apiVersion: autoscaling.k8s.io/v1beta2
kind: VerticalPodAutoscalerCheckpoint
metadata:
  creationTimestamp: 2019-03-29T12:50:59Z
  generation: 1
  name: xx
  namespace: x
  resourceVersion: "x"
  selfLink: /x/x/xx
  uid: x
spec:
  containerName: x
  vpaObjectName: x-vpa
status:
  cpuHistogram:
    bucketWeights:
      "0": 10000
      "1": 233
      "2": 26
      "3": 31
      "4": 1
      "5": 16
      "6": 1
      "7": 25
      "8": 34
      "9": 54
      "10": 54
      "11": 29
      "12": 40
      "13": 24
      "14": 45
      "15": 50
      "16": 11
      "17": 22
    referenceTimestamp: 2019-05-15T00:00:00Z
    totalWeight: 234.72183
  firstSampleStart: null
  lastSampleStart: 2019-05-16T08:45:30Z
  lastUpdateTime: null
  memoryHistogram:
    bucketWeights:
      "44": 243
      "45": 88
      "46": 384
      "47": 111
      "48": 93
      "49": 10
      "51": 10000
      "55": 23
      "59": 44
    referenceTimestamp: 2019-05-16T00:00:00Z
    totalWeight: 1.3683399
  totalSamplesCount: 18876
  version: v3

kgolab commented 5 years ago

From the histograms I can see that the application in question:

usually sits idle (under 10 mCPU) - this would explain why the minimal recommendation is returned; at the same time the biggest spike is around 260 mCPU and there the application might be throttled (this depends whether limits are set and how much CPU is available at the moment - on average there is a lot from what you've shown)
usually uses around 2.2 GB of RAM, with pretty stable pattern; is it a JVM by any chance?
used at most 3.5 GB of RAM but this is sporadic (possibly happened only once).

In this particular case I'd expect that VPA returns minimal CPU recommendation and recommends around 2.5-2.6 GB RAM. Is this what you observe?

Are the throttled / OOM-ing applications using resource limits? If yes - RAM, CPU, both? Do you by any chance know how much RAM the application wanted to use when an OOM happened? It's sometimes visible in the event.

koflerm commented 5 years ago

Yes recommended CPU is 150m (because we set the lower bound to it, earlier it was ~20m) and the recommended memory is 2.6 GB.

Besides, the application is written in Python sorry I forgot to mention that.

Yes we are using resource limits for CPU and RAM. The recommendation includes a recommended request (target from VPA) and a recommended limit (upper bound from VPA), which is, in the case of the application above, the following:

CPU request: 150m
CPU limit: 150m
memory request: ~2.6GB
memory limit: ~2.8GB

Below you can find the VPA object:

apiVersion: autoscaling.k8s.io/v1beta2
kind: VerticalPodAutoscaler
metadata:
  creationTimestamp: 2019-03-29T12:33:02Z
  generation: 1
  name: x-vpa
  namespace: x
  resourceVersion: "x"
  selfLink: /x/x/x
  uid: xxx
spec:
  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 150m
        memory: 100Mi
  targetRef:
    apiVersion: apps.openshift.io/v1
    kind: DeploymentConfig
    name: x
  updatePolicy:
    updateMode: "Off"
status:
  conditions:
  - lastTransitionTime: 2019-03-29T12:34:33Z
    status: "True"
    type: RecommendationProvided
  recommendation:
    containerRecommendations:
    - containerName: x
      lowerBound:
        cpu: 150m
        memory: "2407137410"
      target:
        cpu: 150m
        memory: "2677845899"
      uncappedTarget:
        cpu: 25m
        memory: "2677845899"
      upperBound:
        cpu: 150m
        memory: "2880606341"

kgolab commented 5 years ago

I'd recommend dropping the CPU limit, it's actual reason the application is CPU-throttled. AFAIK the limit is enforced with 1-second granularity so if the app is busy, it will burn its CPU allocation in 0.15s and be forced to sit idle for the next 0.85s. If you feel that you really need this limit I'd gladly hear more as to why exactly.

As for memory the situation is more tricky. From the checkpoint data the memory usage looks pretty stable but apparently sometimes the application requires much more RAM. Do you know what could be causing this? Are there some rare activities that would cause sudden increase in RAM usage (maybe a periodic refresh of caches)? Maybe the moment of OOM and OOM event data could lead to finding this behaviour. Once we understand the problem better we might come up with some solution for it. At the moment I don't see anything that would help here out-of-the-box.

BTW - using upperBound as input for limit is a new idea for me, I need to mull it over.

koflerm commented 5 years ago

The problem with removing the CPU limits is that we have a lot of customers and removing quota limits is not possible in this case or rather prohibited. Also, we do not have very much CPUs left to assign. The usage above is only the usage of 1 node all in all we have a CPU limit overcommitment of 200% (this is the Grafana monitoring limit which includes completed pods so it is not that much but definitely over 100%) and a memory limit commitment of ~30% which is no problem. The CPU usage is the main reason for introducing the VPA in our cluster so removing the CPU limits is critical. Maybe you can tell me a proper value to set the CPU limit e.g. 2x the target or something like that.

In the meanwhile, I will do some research regarding the memory peaks of the python application and keep you updated on that.

kgolab commented 5 years ago

Regarding CPU limit - I'd still recommend dropping it altogether. The Pods are scheduled based on resource requests and not limits so once the requests are OK, you should be fine with regards to CPU usage and the app could use any spare cycles that the Node happens to have at the moment.

For me the problem starts with the fact that most of the time the application is not using CPU at all and thus VPA drives the recommendation towards 0. Now, if you have multiple applications like that, many of them will fit into a Node simply because of minimal CPU request. When a usage peaks for a given application, the real question is whether it's only this app that's under load (or at least the peaks are independent for any given pair of apps) or does the peak concern multiple applications at once? In the former case, if you don't set limits, the application would use spare CPU and there should be no throttling. In the latter case I think there is a more subtle problem whether you really want to set the resource requests based on average usage or more on the peak (or maybe on the acceptable performance during the peak).

koflerm commented 5 years ago

We cannot drop the limits because dropping the limit means dropping the limit in the quota. And dropping the limit in the quota means we cannot limit the CPU usage of customers which needs to be ensured. We cannot give customers the right to choose a CPU limit.

I think the next step will be to use the upper bound as request.

Another problem we are facing is that the recommender cannot retrieve resource history data from prometheus with error code 403 forbidden even though we gave the vpa-recommender service account cluster-admin rights temporarily. Do you know what can be wrong here? Isn't the request for gathering the historical data authenticated using the service account?

UPDATE: 25.05. 13:00 I suceeded in configuring VPA to use prometheus but here are some major issues:

It looks like the requests used to retrieve the history storage is not using the authentication token from the "vpa-recommender" service account which was a problem for us because we are using an oauth proxy for prometheus authetication which had to be disabled for now. It would be really nice if you use the authentication token for the history storage requests in order to work with oatuh authentication.
I had to increase the memory limit to 10 GB because that's the amount of memory the VPA recommender needed to load the data sucessfully from prometheus.

Another question occured when watching the recommender log: Why is it always printing the following error for every VPA object even though we do not use v1beta1 label selection? Can I disable v1beta1 label selection?

E0520 12:53:42.712335 1 cluster_feeder.go:421] Error while fetching legacy selector. Reason: v1beta1 selector not found

kgolab commented 5 years ago

Re: Prometheus I'm glad you've managed to get it working. I'm a little bit surprised about the memory, how many VPA objects & pods do you have in the cluster?

Re: limits If you really don't want to assign/give idle CPU cycles to anyone than indeed the limits are the way to go. The problem with coarse throttling & latency spikes will remain, though.

If you didn't care about idle CPU cycles and only wanted to make sure nobody can request more resources than they are allowed to, then quota on CPU requests would do the trick. But this leaves the possibility to get more CPU than allowed by quota when the cluster has spare resources.

Using upper bound as resource request is not likely to solve the problem here - this value is only for VPA updater, to avoid unnecessary evictions of workloads that are close to the recommended resources. With time it's very likely to be close to target recommendation anyway.

koflerm commented 5 years ago

Currently we have 284 VPA instances in our cluster, but the memory peak is no problem.

Fortunately, the performance issues seem to be fixed. Because of the prometheus data, the recommendations are now more accurate and everything seems to work perfect now so the solution was to enable prometheus as the history storage.

The only problem at the moment is, as mentioned above, the VPA recommender spams the following error message: E0520 12:53:42.712335 1 cluster_feeder.go:421] Error while fetching legacy selector. Reason: v1beta1 selector not found

Do you know any solution for that? Can I disable the v1beta1 selection?

milesbxf commented 5 years ago

Currently we have 284 VPA instances in our cluster, but the memory peak is no problem.

Fortunately, the performance issues seem to be fixed. Because of the prometheus data, the recommendations are now more accurate and everything seems to work perfect now so the solution was to enable prometheus as the history storage.

The only problem at the moment is, as mentioned above, the VPA recommender spams the following error message: E0520 12:53:42.712335 1 cluster_feeder.go:421] Error while fetching legacy selector. Reason: v1beta1 selector not found

Do you know any solution for that? Can I disable the v1beta1 selection?

+1, also seeing this - though looks like the legacy selector has now been removed on master in https://github.com/kubernetes/autoscaler/commit/4e07e1eac160759c9c515aa0c7d2db137dd7b06e#diff-a766133cfa48bb3e35b13670f45a8497.

@bskiba would it be possible to cut a new release?

bskiba commented 5 years ago

@milesbxf is the log message causing issues? I believe it has the wrong severity level, but other than that shouldn't be problematic. I'm working on having the new release cut, probably in the next couple weeks.

milesbxf commented 5 years ago

@bskiba no issues other than pure volume of logs, so if it were lower severity that would work well too 👍

Tediferous commented 5 years ago

Currently we have 284 VPA instances in our cluster, but the memory peak is no problem.

Fortunately, the performance issues seem to be fixed. Because of the prometheus data, the recommendations are now more accurate and everything seems to work perfect now so the solution was to enable prometheus as the history storage.

The only problem at the moment is, as mentioned above, the VPA recommender spams the following error message: E0520 12:53:42.712335 1 cluster_feeder.go:421] Error while fetching legacy selector. Reason: v1beta1 selector not found

Do you know any solution for that? Can I disable the v1beta1 selection?

I am also running into v1beta1 selector error

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

bskiba commented 4 years ago

I believe the v1beta1 problem has been fixed since. Please reopen if I am missing something.

bskiba commented 4 years ago

/close

k8s-ci-robot commented 4 years ago

@bskiba: Closing this issue.

In response to [this](https://github.com/kubernetes/autoscaler/issues/2016#issuecomment-575557765): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

qhh0205 commented 2 years ago

recommendations strongly differ hi, @koflerm , i have see that you said you send emails when the recommendations strongly differ from the current resource assignment, and i want to know how do you define the differ? Are there any algorithm？thanks!

kubernetes / autoscaler

Performance issues when using VPA resource recommendations #2016