Open zoucheng2018 opened 3 years ago
@zoucheng2018 Can you answer a few questions here?
Do you have the Container Insights add-on enabled for your cluster (or any other monitor solution)? What node size are you seeing this on? How many containers are running on the node? Are limits/requests configured for the pods / how densely packed is the node?
Thanks!
Also: what version of Kubernetes? and What was the etw command you ran?
Do you have the Container Insights add-on enabled for your cluster (or any other monitor solution)? I believe so, there’re also Geneva agents running but they don’t appear to be the source of the RPC calls to vmcompute.
What node size are you seeing this on?
The node SKU is Standard_D32_v3.
How many containers are running on the node? The high vmcmopute CPU issue is quite pervasive across all our clusters, the clusters are generally not busy, so I’m not sure it’s related to the workload. The machines that I ran traces had about 5-10 containers.
Are limits/requests configured for the pods / how densely packed is the node?
Yes, they’re. Most clusters are not very dense, but some are.
Also: what version of Kubernetes? and What was the etw command you ran?
We’re running 1.18.
Perfview command to collect CPU trace: PerfView Collect PerfView-Manual /BufferSize:3072 /Circular:3072 /MaxCollectSec:120 /KernelEvents=Process+Thread+Profile+ImageLoad /ClrEvents:GC+Loader+Exception+Stack /Zip /AcceptEULA /NoView /NoNGENRundown /NoGui
RPC Trace: PerfView Collect PerfView-RPC /KernelEvents=Process+Thread+ImageLoad /providers:Microsoft-Windows-RPC:Microsoft-Windows-RPC/Debug::stack /ClrEvents:Loader /BufferSize:2048 /Circular:2048 /MaxCollectSec:120 /Zip /AcceptEULA /nogui /NoView
So for some update here (and sorry for the delay), we've found that the OS isn't as optimized as it could be for returning some memory statistics. I have a change here that speeds things up a bit https://github.com/microsoft/hcsshim/pull/1362, although I'll let @marosset or @jsturtevant speak on the container-insights extension as I don't know how much extra this would add on in terms of query volume. I'm hoping we can get that change out into AKS in the next month, but that's the optimist in me haha.
We've helped fixed the readiness probe in container insights over the last several months that should help the performance of the container that runs on Windows. We also made two perf improvements to kubelet in 1.23 that would reduce overall cpu usage as well: https://github.com/kubernetes/kubernetes/pull/105744 and https://github.com/kubernetes/kubernetes/pull/104287
For my repro, disabling container insights seem to have helped, although we had other changes in the cluster and still need more time to confirm if it's indeed container insights specifically. Looking forward to fix to re-enable container insights.
vmcompute.exe is taking up to 25% of a core on our AKS managed nodes, the CPU profiling data shows it was mostly caused by vmcompute!HcsRpc_GetSystemProperties. Below image contains the detailed stack:
RPC ETW traces indicates the calls are made from kubelet.exe, one example trace shows it calls every 500ms. This API is quite expensive inside vmcompute.exe, can we tune the frequency or use the alternative APIs?