Open googs1025 opened 2 months ago
/sig node
Why is this needed?
Although we often use metrics such as
go_goroutine
orgo_sched_goroutines_goroutines
(We can usego_goroutines{job="kubelet"}
orgo_sched_goroutines_goroutines{job="kubelet"}
to query in prometheus) to view the active goroutines of kubelet. However, using this metrics, we cannot know the number of goroutines for which operations. Therefore, I want to add this indicator to the kubelet component. For example: Currently, there are many operations in kubelet that may add many goroutines: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod_workers.go#L945 https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/images/puller.go#L55 https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L687In addition, kube-scheduler has implemented similar features. issue tracked: #112003
Hi! this is nice, but could you please elaborate a couple usecases on which having the breakdown of goroutines per operation helps?
sure. This was because there were too many goroutine alerts on some nodes within the company's internal infrastructure a few days ago. During the investigation, it was discovered that there were too many goroutines in the kubelet, prompting the need to identify the root cause of the excessive goroutines. Currently, we are using the go_goroutines{job="kubelet"} method for investigation. However, I believe it would be very helpful for troubleshooting if kubelet could provide a more detailed categorization of goroutine numbers.
The example I provided in the issue is where I found potentially a significant number of goroutines being created in the kubelet source code. This is just a part that I personally consider relatively important. There are also other areas where multiple goroutines are being executed. If necessary, we can continue to add those as well. :)
/sig instrumentation
Overall I think this makes sense and it's a nice addition, but let's doublecheck with sig instrumentation A thing that pops to my mind is that I'm not sure we have a baseline or safe upper bound. Perhaps is not the best example, but let's consider async image pull. How many goroutines is "too many"? even if we have unwarranted spikes, is there any remediation besides "file a issue on k/k"?
That said, overall I think that knowing the numbers won't hurt and it's a prerequisite for any further refinement, so I like this.
cc @logicalhan
/triage accepted
I'm not sure you're going to have many helpful labels here, or else the labels will be unbounded. It's better to use the go-runtimes metric and use pprof to do deeper analysis.
Hey, @logicalhan. Thanks for your response and discussion. Thanks to folks from sig instrumentation for joining the discussion. As mentioned in the issue, using metrics like go_goroutines
is currently helpful for troubleshooting. However, the Go runtime metrics may not clearly show which specific operations in the kubelet are causing an excessive number of goroutines. I believe that adding metrics specifically for kubelet goroutines could enhance both troubleshooting and observability (without impacting performance). This way, we could configure better displays in Grafana or set up more effective alerts. Of course, the kubelet component naturally creates and destroys many goroutines, so it seems like a topic worth discussing where to add these metrics for better insights.
/cc @SergeyKanzhelev @bart0sh @kannon92 just friendly ping! If you don't mind, could you join the discussion? :)
too many goroutine alerts
How many was too many? Is there any additional details you can share on what is the downside here and how one will be using this information?
What would you like to be added?
I want to add a new metric to track the number of goroutines in kubelet. There seems to be no metric defined in kubelet to track the number of goroutines.
Why is this needed?
Although we often use metrics such as
go_goroutine
orgo_sched_goroutines_goroutines
(We can usego_goroutines{job="kubelet"}
orgo_sched_goroutines_goroutines{job="kubelet"}
to query in prometheus) to view the active goroutines of kubelet. However, using this metrics, we cannot know the number of goroutines for which operations. Therefore, I want to add this indicator to the kubelet component. For example: Currently, there are many operations in kubelet that may add many goroutines: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pod_workers.go#L945 https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/images/puller.go#L55 https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L843In addition, kube-scheduler has implemented similar features. issue tracked: https://github.com/kubernetes/kubernetes/pull/112003