Open silenceper opened 2 years ago
@silenceper Do you know if the underlying runtime is containerd or docker?
@dcantah docker
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:09:02 2022
OS/Arch: windows/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.24)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:03:58 2022
OS/Arch: windows/amd64
Experimental: false
sometimes error message is hcsshim::System::CreateProcess: failure in a Windows system call: The RPC server is unavailable. (0x6ba)
liveness probe command just ipconfig
@silenceper We're seeing this as well in some cases and were looking into it so its good we have another repro, sounds like its not a containerd issue only. Do you have any info on memory/cpu usage for the nodes this is occurring on? Are you able to get onto the host to run some commands? I have a profile that will collect some etw events that will help us look into things if so
Also curious, once this happens does the machine ever get back into a normal state, do any other containers successfully launch?
sometime error message is error during hcsshim::System::CreateProcess: failure in a Windows system call: The paging file is too small for this operation to complete. (0x5af)
But I see that the machine has a lot of memory to use.
When the error occurs, the machine is normal, and it may be normal to execute the ipconfig
probe next time, but the error occurs more frequently.
The machine monitoring information is as follows:
@dcantah I have permission to execute commands on the machine, what commands do I need to execute?
@silenceper Hey sorry for the delay. If you're able to get a consistent repro then follow these steps https://docs.microsoft.com/en-us/virtualization/windowscontainers/troubleshooting#capturing-hcs-verbose-tracing. This captures quite a bit of data which may include currently running processes, node info (what windows build, number of processors etc.) and file paths accessed for some events. If that's alright please send the trace to dcanter@microsoft.com and I can take a look.
This may be networking related, some folks on the net end are looking at this and hope to have an answer
Hoping for a solution
@silenceper Could you share what image you were using if possible (and the command)? As much to repro this as possible. I've found some odd scenario where the process in the container responsible for actually carrying out the process launch is crashing, and this is likely the reason. The error msg is a bit different, but in the same family of "rpc thing didn't work".
Joining in here - we're also experiencing this issue with AKS on different image versions:
AKSWindows-2019-containerd-17763.3650.221202
AKSWindows-2019-containerd-17763.3287.220810
We see it with the ama-logs-windows
pod (with the effect of the lifeness probe no longer working, which means the pod does no longer restart although it should).
We can also reproduce it by trying to kubectl exec
into the pod while it is in that state:
kubectl exec -n kube-system ama-logs-windows-pg88k -- cmd.exe
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "778b1b4dee402ab2371bbe0af43e76df8ed4a8d56d33af02672ec61c8d640dc4": hcs::System::CreateProcess c5d961ebbb1316464add3d00f8b91034d034fa35869db7f0ff1f655883d15101: The RPC server is unavailable.: unknown
We've also created an ETW trace that contains this error, e.g.:
ProcessorNumber="2" wilActivity="{ hresult:-2147023174, fileName:"onecore\\vm\\compute\\service\\cexec\\lib\\cexeclib.cpp", lineNumber:220, module:"vmcompute.exe", failureType:1, message:"", threadId:30312, callContext:"\\HcsRpc_CreateProcess\\ComputeSystemManager_ExecuteProcess\\WindowsContainer_ExecuteProcess", originatingContextId:2105, originatingContextName:"HcsRpc_CreateProcess", originatingContextMessage:"", currentContextId:2107, currentContextName:"WindowsContainer_ExecuteProcess", currentContextMessage:"", failureId:1441, failureCount:577, function:"" }"
Can we do something to help investigate the problem?
in windows kubernetes, set liveness probe, return error:
windows server version: