microsoft / hcsshim

Windows - Host Compute Service Shim
MIT License
577 stars 259 forks source link

High vmcompute.exe CPU due to frequent HcsGetComputeSystemProperties calls. #989

Open zoucheng2018 opened 3 years ago

zoucheng2018 commented 3 years ago

vmcompute.exe is taking up to 25% of a core on our AKS managed nodes, the CPU profiling data shows it was mostly caused by vmcompute!HcsRpc_GetSystemProperties. Below image contains the detailed stack:

image

RPC ETW traces indicates the calls are made from kubelet.exe, one example trace shows it calls every 500ms. This API is quite expensive inside vmcompute.exe, can we tune the frequency or use the alternative APIs?

image

marosset commented 3 years ago

@zoucheng2018 Can you answer a few questions here?

Do you have the Container Insights add-on enabled for your cluster (or any other monitor solution)? What node size are you seeing this on? How many containers are running on the node? Are limits/requests configured for the pods / how densely packed is the node?

Thanks!

jsturtevant commented 3 years ago

Also: what version of Kubernetes? and What was the etw command you ran?

zoucheng2018 commented 3 years ago

Do you have the Container Insights add-on enabled for your cluster (or any other monitor solution)? I believe so, there’re also Geneva agents running but they don’t appear to be the source of the RPC calls to vmcompute.

What node size are you seeing this on?
The node SKU is Standard_D32_v3.

How many containers are running on the node? The high vmcmopute CPU issue is quite pervasive across all our clusters, the clusters are generally not busy, so I’m not sure it’s related to the workload. The machines that I ran traces had about 5-10 containers.

Are limits/requests configured for the pods / how densely packed is the node?
Yes, they’re. Most clusters are not very dense, but some are.

Also: what version of Kubernetes? and What was the etw command you ran?
We’re running 1.18.   

Perfview command to collect CPU trace: PerfView Collect PerfView-Manual /BufferSize:3072 /Circular:3072 /MaxCollectSec:120 /KernelEvents=Process+Thread+Profile+ImageLoad /ClrEvents:GC+Loader+Exception+Stack /Zip /AcceptEULA /NoView /NoNGENRundown /NoGui              

RPC Trace: PerfView Collect PerfView-RPC /KernelEvents=Process+Thread+ImageLoad /providers:Microsoft-Windows-RPC:Microsoft-Windows-RPC/Debug::stack /ClrEvents:Loader /BufferSize:2048 /Circular:2048 /MaxCollectSec:120 /Zip /AcceptEULA /nogui /NoView

dcantah commented 2 years ago

So for some update here (and sorry for the delay), we've found that the OS isn't as optimized as it could be for returning some memory statistics. I have a change here that speeds things up a bit https://github.com/microsoft/hcsshim/pull/1362, although I'll let @marosset or @jsturtevant speak on the container-insights extension as I don't know how much extra this would add on in terms of query volume. I'm hoping we can get that change out into AKS in the next month, but that's the optimist in me haha.

jsturtevant commented 2 years ago

We've helped fixed the readiness probe in container insights over the last several months that should help the performance of the container that runs on Windows. We also made two perf improvements to kubelet in 1.23 that would reduce overall cpu usage as well: https://github.com/kubernetes/kubernetes/pull/105744 and https://github.com/kubernetes/kubernetes/pull/104287

yanrez commented 2 years ago

For my repro, disabling container insights seem to have helped, although we had other changes in the cluster and still need more time to confirm if it's indeed container insights specifically. Looking forward to fix to re-enable container insights.