kubernetes-sigs / node-feature-discovery

Node feature discovery for Kubernetes
Apache License 2.0
769 stars 239 forks source link

Keep getting CPUThrottlingHigh alert on the `gc` pod #1724

Closed budimanjojo closed 3 months ago

budimanjojo commented 4 months ago

What happened: After updating to v0.16.0, I keep getting CPUThrottlingHigh alert on the garbage collection pod like this:

CPUThrottlingHigh (Info)
Description: 35.71% throttling of CPU in namespace kube-system for container gc in pod node-feature-discovery-gc-696b644f9-2rwql.

What you expected to happen: Everything should be running like it used to be. I have fairly default helm values:

master:
  extraLabelNs:
    - gpu.intel.com

How to reproduce it (as minimally and precisely as possible): Use the latest v0.16.0 with the values above.

Anything else we need to know?:

Environment:

budimanjojo commented 4 months ago

Maybe this line is set too low https://github.com/kubernetes-sigs/node-feature-discovery/blob/560905fbee7bb8fe475831cc3b86f3a62d78d43e/deployment/helm/node-feature-discovery/values.yaml#L533

Or maybe there's a bug in the garbage collection logic making it taking too many resources.

marquiz commented 4 months ago

Thanks @budimanjojo for reporting this. How big is your cluster (ca. how many nodes)?

In retrospect, setting the cpu limits might not have been that good idea. We might want to remove those (and cut a patch release) 🧐

The most immediate fix for you would probably be to remove the cpu limits, i.e. do Helm install with --set gc.resources.limits.cpu=null

budimanjojo commented 4 months ago

Hi @marquiz! I have a 3 nodes cluster so it's a pretty small one.

Yeah I agree with having no CPU limits set at least in the gc pod by default. Should I open a PR or I'll just wait?

marquiz commented 4 months ago

I have a 3 nodes cluster so it's a pretty small one.

OK, not a huge one, then. 😅 Looks like we need to investigate that a bit further 🤔

Yeah I agree with having no CPU limits set at least in the gc pod by default. Should I open a PR or I'll just wait?

Please do, more contributors -> better 😊 Let's remove cpu limits for all daemons. Also, we need to update the tables of parameters in docs/deployment/helm.md, accordingly (for the defaults)

budimanjojo commented 4 months ago

@marquiz I just created the PR, please take a look. I removed CPU limits for all daemons instead of just the garbage collection pod according to your recommendation.