feature(kubelet): add goroutines metric in the kubelet component

googs1025 commented 2 months ago

What would you like to be added?

I want to add a new metric to track the number of goroutines in kubelet. There seems to be no metric defined in kubelet to track the number of goroutines.

Why is this needed?

Although we often use metrics such as go_goroutine or go_sched_goroutines_goroutines(We can use go_goroutines{job="kubelet"} or go_sched_goroutines_goroutines{job="kubelet"} to query in prometheus) to view the active goroutines of kubelet. However, using this metrics, we cannot know the number of goroutines for which operations. Therefore, I want to add this indicator to the kubelet component. For example: Currently, there are many operations in kubelet that may add many goroutines: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod_workers.go#L945 https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/images/puller.go#L55 https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L843

In addition, kube-scheduler has implemented similar features. issue tracked: https://github.com/kubernetes/kubernetes/pull/112003

googs1025 commented 2 months ago

/sig node

ffromani commented 2 months ago

Why is this needed?

Although we often use metrics such as go_goroutine or go_sched_goroutines_goroutines(We can use go_goroutines{job="kubelet"} or go_sched_goroutines_goroutines{job="kubelet"} to query in prometheus) to view the active goroutines of kubelet. However, using this metrics, we cannot know the number of goroutines for which operations. Therefore, I want to add this indicator to the kubelet component. For example: Currently, there are many operations in kubelet that may add many goroutines: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod_workers.go#L945 https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/images/puller.go#L55 https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L687

In addition, kube-scheduler has implemented similar features. issue tracked: #112003

Hi! this is nice, but could you please elaborate a couple usecases on which having the breakdown of goroutines per operation helps?

googs1025 commented 2 months ago

sure. This was because there were too many goroutine alerts on some nodes within the company's internal infrastructure a few days ago. During the investigation, it was discovered that there were too many goroutines in the kubelet, prompting the need to identify the root cause of the excessive goroutines. Currently, we are using the go_goroutines{job="kubelet"} method for investigation. However, I believe it would be very helpful for troubleshooting if kubelet could provide a more detailed categorization of goroutine numbers.

The example I provided in the issue is where I found potentially a significant number of goroutines being created in the kubelet source code. This is just a part that I personally consider relatively important. There are also other areas where multiple goroutines are being executed. If necessary, we can continue to add those as well. :)

https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod_workers.go#L945 This is the part of the code in kubelet where a large number of goroutines are created when many pods exist. We should document this for monitoring metrics.
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/images/puller.go#L55 This is the section where images are asynchronously pulled, which can take a considerable amount of time. We should document this for monitoring metrics.
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L843 This is the place where containers are deleted. When a large number of pods are deleted simultaneously, it might also create multiple goroutines. We should document this for monitoring metrics.

ffromani commented 2 months ago

/sig instrumentation

Overall I think this makes sense and it's a nice addition, but let's doublecheck with sig instrumentation A thing that pops to my mind is that I'm not sure we have a baseline or safe upper bound. Perhaps is not the best example, but let's consider async image pull. How many goroutines is "too many"? even if we have unwarranted spikes, is there any remediation besides "file a issue on k/k"?

That said, overall I think that knowing the numbers won't hurt and it's a prerequisite for any further refinement, so I like this.

dashpole commented 2 months ago

cc @logicalhan

dashpole commented 2 months ago

/triage accepted

logicalhan commented 2 months ago

I'm not sure you're going to have many helpful labels here, or else the labels will be unbounded. It's better to use the go-runtimes metric and use pprof to do deeper analysis.

googs1025 commented 2 months ago

Hey, @logicalhan. Thanks for your response and discussion. Thanks to folks from sig instrumentation for joining the discussion. As mentioned in the issue, using metrics like go_goroutines is currently helpful for troubleshooting. However, the Go runtime metrics may not clearly show which specific operations in the kubelet are causing an excessive number of goroutines. I believe that adding metrics specifically for kubelet goroutines could enhance both troubleshooting and observability (without impacting performance). This way, we could configure better displays in Grafana or set up more effective alerts. Of course, the kubelet component naturally creates and destroys many goroutines, so it seems like a topic worth discussing where to add these metrics for better insights.

googs1025 commented 2 months ago

/cc @SergeyKanzhelev @bart0sh @kannon92 just friendly ping! If you don't mind, could you join the discussion? :)

SergeyKanzhelev commented 2 months ago

too many goroutine alerts

How many was too many? Is there any additional details you can share on what is the downside here and how one will be using this information?

kubernetes / kubernetes