microsoft / OMS-docker

Docker image for OMS (Operations Management Suite) Linux agent.
Other
79 stars 135 forks source link

AKS: omsagent-win pods restarts again and again #436

Open chefcook opened 2 years ago

chefcook commented 2 years ago

omsagent-win is the pod in the kube-system namespace that is supplied with aks included if you have azure insights enabled. I use a hybrid environment here. Win & Linux are used.

output: kubectl get nodes

NAME                            STATUS   ROLES   AGE     VERSION
aks-nplin-21116150-vmss000002   Ready    agent   2d21h   v1.21.2
aks-nplin-21116150-vmss000003   Ready    agent   2d21h   v1.21.2
aks-nplin-21116150-vmss000006   Ready    agent   2d21h   v1.21.2
aksnpwin000003                  Ready    agent   2d20h   v1.21.2
aksnpwin000004                  Ready    agent   2d20h   v1.21.2
aksnpwin000005                  Ready    agent   2d20h   v1.21.2

On a linux node everything works fine.

NAME                                    READY   STATUS    RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
omsagent-xscbv                          2/2     Running   0          2d21h   10.240.1.108   aks-nplin-21116150-vmss000006   <none>           <none>
omsagent-k2zlx                          2/2     Running   0          2d21h   10.240.0.137   aks-nplin-21116150-vmss000002   <none>           <none>
omsagent-pzd4s                          2/2     Running   0          2d21h   10.240.0.79    aks-nplin-21116150-vmss000003   <none>           <none>

But as soon as it goes to a windows node I have a restart all the time. NodeSelector was also checked.

NAME                                    READY   STATUS    RESTARTS   AGE     IP             NODE                            NOMINATED NODE   READINESS GATES
omsagent-win-2vwqd                      1/1     Running   283        2d20h   10.240.2.64    aksnpwin000005                  <none>           <none>
omsagent-win-5kz2h                      1/1     Running   73         2d20h   10.240.1.178   aksnpwin000003                  <none>           <none>
omsagent-win-gmwk6                      1/1     Running   25         2d20h   10.240.1.46    aksnpwin000004                  <none>           <none>

output: kubectl -n kube-system describe pod omsagent-win-2vwqd

Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  10m (x950 over 2d20h)  kubelet  Liveness probe failed:
  Normal   Killing    19s (x708 over 2d20h)  kubelet  Container omsagent-win failed liveness probe, will be restarted

I have already tried to give the pods more cpu and ram that worked at the beginning but after a while (about 30 minutes) they go back to their old original values.

Any ideas on how to examine this in a different way?

Thanks in advance!

austonli commented 2 years ago

This issue has been addressed, the fix is being currently being rolled out, and should be available by the end of next week. If you have more questions or if this issue still persists, please feel free to contact us at askcoin@microsoft.com

vap78 commented 1 year ago

Is this issue resolved?

We are at AKS 1.23.5 now and are still experiencing a lot of restarts of omsagent-win. The KubeEvents log contains lots of events like this:

(combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "e515bf191f6b7332e8f46c608e8d6f2be1f4a52dc4a4e5f4b9711c3957c1a3e0": hcs::System::CreateProcess 09a8a8f75482e1e512d5acee249f1ce00837fdd882501ab1c927b8148d4c1800: The RPC server is unavailable.: unknown

We are using node image version AKSWindows-2019-containerd-17763.3046.220615 with runtime containerd://1.6.6+azure