microsoft / hcsshim

Windows - Host Compute Service Shim
MIT License
571 stars 258 forks source link

encountered an error during hcsshim::System::CreateProcess: failure in a Windows system call: The remote procedure call failed. (0x6be #1472

Open silenceper opened 2 years ago

silenceper commented 2 years ago

in windows kubernetes, set liveness probe, return error:

I0729 23:14:34.342056   75972 remote_runtime.go:402] [RemoteRuntimeService] ExecSync Response (containerID=2b8bdb47d92e98b80d5ffc8b45ac0d7ff18182838f1d0479b5582a86c132934b, ExitCode=126)
I0729 23:14:34.342056   75972 exec.go:62] Exec probe response: "container 2b8bdb47d92e98b80d5ffc8b45ac0d7ff18182838f1d0479b5582a86c132934b encountered an error during hcsshim::System::CreateProcess: failure in a Windows system call: The remote procedure call failed. (0x6be)\r\n"

windows server version:

  Kernel Version:             10.0.17763.2628
  OS Image:                   Windows Server 2019 Datacenter
dcantah commented 2 years ago

@silenceper Do you know if the underlying runtime is containerd or docker?

silenceper commented 2 years ago

@dcantah docker

Version:           20.10.17
 API version:       1.41
 Go version:        go1.17.11
 Git commit:        100c701
 Built:             Mon Jun  6 23:09:02 2022
 OS/Arch:           windows/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.17
  API version:      1.41 (minimum version 1.24)
  Go version:       go1.17.11
  Git commit:       a89b842
  Built:            Mon Jun  6 23:03:58 2022
  OS/Arch:          windows/amd64
  Experimental:     false
silenceper commented 2 years ago

sometimes error message is hcsshim::System::CreateProcess: failure in a Windows system call: The RPC server is unavailable. (0x6ba)

silenceper commented 2 years ago

liveness probe command just ipconfig

dcantah commented 2 years ago

@silenceper We're seeing this as well in some cases and were looking into it so its good we have another repro, sounds like its not a containerd issue only. Do you have any info on memory/cpu usage for the nodes this is occurring on? Are you able to get onto the host to run some commands? I have a profile that will collect some etw events that will help us look into things if so

dcantah commented 2 years ago

Also curious, once this happens does the machine ever get back into a normal state, do any other containers successfully launch?

silenceper commented 2 years ago

sometime error message is error during hcsshim::System::CreateProcess: failure in a Windows system call: The paging file is too small for this operation to complete. (0x5af)

But I see that the machine has a lot of memory to use.

silenceper commented 2 years ago

When the error occurs, the machine is normal, and it may be normal to execute the ipconfig probe next time, but the error occurs more frequently. The machine monitoring information is as follows:

image
silenceper commented 2 years ago

@dcantah I have permission to execute commands on the machine, what commands do I need to execute?

dcantah commented 2 years ago

@silenceper Hey sorry for the delay. If you're able to get a consistent repro then follow these steps https://docs.microsoft.com/en-us/virtualization/windowscontainers/troubleshooting#capturing-hcs-verbose-tracing. This captures quite a bit of data which may include currently running processes, node info (what windows build, number of processors etc.) and file paths accessed for some events. If that's alright please send the trace to dcanter@microsoft.com and I can take a look.

dcantah commented 2 years ago

This may be networking related, some folks on the net end are looking at this and hope to have an answer

silenceper commented 2 years ago

Hoping for a solution

dcantah commented 2 years ago

@silenceper Could you share what image you were using if possible (and the command)? As much to repro this as possible. I've found some odd scenario where the process in the container responsible for actually carrying out the process launch is crashing, and this is likely the reason. The error msg is a bit different, but in the same family of "rpc thing didn't work".

fschmied commented 1 year ago

Joining in here - we're also experiencing this issue with AKS on different image versions:

We see it with the ama-logs-windows pod (with the effect of the lifeness probe no longer working, which means the pod does no longer restart although it should).

We can also reproduce it by trying to kubectl exec into the pod while it is in that state:

kubectl exec -n kube-system ama-logs-windows-pg88k -- cmd.exe
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "778b1b4dee402ab2371bbe0af43e76df8ed4a8d56d33af02672ec61c8d640dc4": hcs::System::CreateProcess c5d961ebbb1316464add3d00f8b91034d034fa35869db7f0ff1f655883d15101: The RPC server is unavailable.: unknown

We've also created an ETW trace that contains this error, e.g.:

ProcessorNumber="2" wilActivity="{ hresult:-2147023174, fileName:"onecore\\vm\\compute\\service\\cexec\\lib\\cexeclib.cpp", lineNumber:220, module:"vmcompute.exe", failureType:1, message:"", threadId:30312, callContext:"\\HcsRpc_CreateProcess\\ComputeSystemManager_ExecuteProcess\\WindowsContainer_ExecuteProcess", originatingContextId:2105, originatingContextName:"HcsRpc_CreateProcess", originatingContextMessage:"", currentContextId:2107, currentContextName:"WindowsContainer_ExecuteProcess", currentContextMessage:"", failureId:1441, failureCount:577, function:"" }" 

image

Can we do something to help investigate the problem?