Wrong CPU utilization with --ignore-daemonsets-utilization=true

AnhQKatalon commented 9 months ago

Which component are you using?: Cluster Autoscaler

What version of the component are you using?: v1.27.2

Component version:

What k8s version are you using (kubectl version)?: EKS 1.27

kubectl version Output

$ kubectl version

What environment is this in?: EKS

What did you expect to happen?: CPU utilization is around 0.52

What happened instead?: Node A is not suitable for removal - cpu utilization too big (0.709220)

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Hello all,

Currently, I am using Cluster Autoscaler for our EKS Cluster. I have just noticed that the log showing the CPU utilization for my nodes does not seem correct.

I have turned on the option --ignore-daemonsets-utilization=true, but it seems the CPU utilization still includes it. Below is the information of my node

Only the final pod is created from the Deployment. All the other pods are from Deamonsets. So from the FAQ, I expect the CPU utilization is calculated by (CPU requests not include Daemonset's Pod) / (CPU Allocatable) = 1000 / 1930 = 0.51

But from the cluster-autoscaler pod log, it outputs the following information: Node XXX is not suitable for removal - cpu utilization too big (0.709220)

I am not sure how the number 0.709220 is calculated. Below is the command from the autoscaler pod

./cluster-autoscaler --cloud-provider=aws --namespace=kube-system --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/eks-cluster-prod --balance-similar-node-groups=false --cordon-node-before-terminating=true --daemonset-eviction-for-empty-nodes=true --daemonset-eviction-for-occupied-nodes=true --ignore-daemonsets-utilization=true --ignore-mirror-pods-utilization=true --logtostderr=true --max-graceful-termination-sec=1200 --scale-down-delay-after-add=5m --scale-down-enabled=true --scale-down-unneeded-time=10m --scale-down-utilization-threshold=0.2 --skip-nodes-with-custom-controller-pods=false --skip-nodes-with-local-storage=false --status-config-map-name=cluster-autoscaler-status --stderrthreshold=info --v=4 --write-status-configmap=true

Appreciate your help so much on this case

daimaxiaxie commented 8 months ago

In the code, the utilization calculation is like this: (CPU requests not include Daemonset's Pod) / (CPU Allocatable - daemonset's pod requests) https://github.com/kubernetes/autoscaler/blob/c96aa9b97087603cb6884e1af7c20fa2969fb86d/cluster-autoscaler/simulator/utilization/info.go#L125

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 3 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/autoscaler/issues/6576#issuecomment-2295337819): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

whatnick commented 1 month ago

/bump

kubernetes / autoscaler

Wrong CPU utilization with --ignore-daemonsets-utilization=true #6576