Hanging Pod sandboxes that timeouts during creation and fail to be cleaned up

ibabou commented 1 year ago

Describe the bug Pod sandbox creations go in hanging state (UNKNOWN), where creations CreateComputeSystem timeouts and cleanup of already disconnected shims continue to fail.

To Reproduce Steps to reproduce the behavior: It happens on nodes after running for a while, it could be at a certain point of overload but we haven't observed a very high CPU/Mem starvation when it occurs. It sometimes auto resolve on it's own, and sandboxes gets cleaned up & replaced with successfully running ones.

Expected behavior A clear and concise description of what you expected to happen. Timeouts are not expected, and cleanup should happen normally if it occurs

Configuration:

Edition: 2019 LTSC
- Base Image being used: Windows Server Core ( "Major": 10, "Minor": 0, "Build": "17763", "Revision": 4377)
Container engine: containerd
Container Engine version 1.6.19

Additional context We spotted a case and collected Container Platform ETW streams while it was happening, logs are shared with @mikezappa87 (Michael Zappa).

Here is the observation with the captured instance, where it kept failing to create the sandbox:

ADD for this sandbox @ ~2023-09-25T18:27

DEFAULT 2023-09-25T18:27:01.352845400Z [resource.labels.nodeName: gke-a55193-nlfp] {"level":"debug","msg":"[cni-net] Processing ADD command with args {ContainerID:8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e Netns:598f84b4-bb5e-4479-9def-ccd2b7129fe1 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=samba;K8S_POD_NAME=smb-server-monitoring-csi-cron-win-28261106-72b9j;K8S_POD_INFRA_CONTAINER_ID=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e;K8S_POD_UID=a01fd019-7a48-4255-9a3b-1928f8b5958f Path:C:\\etc\\kubernetes\\cni}.","time":"2023-09-25T18:27:00Z"}
DEFAULT 2023-09-25T18:27:01.352850600Z [resource.labels.nodeName: gke-a55193-nlfp] {"level":"debug","msg":"hcn::HostComputeEndpoint::Create JSON: {\"Name\":\"8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e_l2bridge\",\"HostComputeNetwork\":\"2a259ee0-97d1-401c-b7d5-93ab6f135599\",\"Policies\":[{\"Type\":\"OutBoundNAT\",\"Settings\":{\"Exceptions\":[\"[169.254.0.0/16\](http://169.254.0.0/16%5C)",\"[10.0.0.0/8\](http://10.0.0.0/8%5C)",\"[172.16.0.0/12\](http://172.16.0.0/12%5C)",\"[192.168.0.0/16\](http://192.168.0.0/16%5C)",\"[100.64.0.0/10\](http://100.64.0.0/10%5C)",\"[192.0.0.0/24\](http://192.0.0.0/24%5C)",\"[192.0.2.0/24\](http://192.0.2.0/24%5C)",\"[192.88.99.0/24\](http://192.88.99.0/24%5C)",\"[198.18.0.0/15\](http://198.18.0.0/15%5C)",\"[198.51.100.0/24\](http://198.51.100.0/24%5C)",\"[203.0.113.0/24\](http://203.0.113.0/24%5C)",\"[240.0.0.0/4\](http://240.0.0.0/4%5C)"]}},{\"Type\":\"SDNRoute\",\"Settings\":{\"DestinationPrefix\":\"[172.31.0.0/16\](http://172.31.0.0/16%5C)",\"NeedEncap\":true}},{\"Type\":\"SDNRoute\",\"Settings\":{\"DestinationPrefix\":\"[172.23.160.187/32\](http://172.23.160.187/32%5C)",\"NeedEncap\":true}}],\"Dns\":{\"Search\":[\"samba.svc.cluster.local\",\"svc.cluster.local\",\"cluster.local\",\"c.cpl-sta-l-app-devlab-01.internal\",\"google.internal\",\"us-west1-c.c.cpl-sta-l-app-devlab-01.internal\"],\"ServerList\":[\"172.31.0.10\"],\"Options\":[\"ndots:5\"]},\"Routes\":[{\"NextHop\":\"172.30.2.2\",\"DestinationPrefix\":\"[0.0.0.0/0\](http://0.0.0.0/0%5C)"}],\"SchemaVersion\":{\"Major\":2,\"Minor\":0}}","time":"2023-09-25T18:27:00Z"}

The timeout (after 4 mins)

DEFAULT 2023-09-25T18:31:00.867115200Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:27:01.254912500Z" level=error msg=Span cid=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e duration=3m59.4140272s endTime="2023-09-25 18:31:00.6689397 +0000 GMT m=+239.527872601" error="hcs::CreateComputeSystem 8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e: context deadline exceeded" name="hcs::CreateComputeSystem" parentSpanID=e359bc276c74eb0d spanID=d33bf8ac720521a9 startTime="2023-09-25 18:27:01.2549125 +0000 GMT m=+0.113845401" traceID=4d12b06b9b2d3045f06e86bfe23ab12e
DEFAULT 2023-09-25T18:31:00.867116400Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:31:00.688577900Z" level=error msg=Span duration="57.3µs" eid= endTime="2023-09-25 18:31:00.6886352 +0000 GMT m=+239.529795401" error="rpc error: code = NotFound desc = task with id: '8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e' not found: not found" name=Delete parentSpanID=3076f3e5a3103b66 pod-id=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e spanID=4f34474eb469ee4f startTime="2023-09-25 18:31:00.6885779 +0000 GMT m=+239.529738101" tid=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e traceID=620a26d87a8ca6494c6d3a904601f212

DEFAULT 2023-09-25T18:31:01.118121700Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:31:00.691410300Z" level=info msg="shim disconnected" id=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e
DEFAULT 2023-09-25T18:31:01.118122100Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:31:00.691547900Z" level=warning msg="cleaning up after shim disconnected" id=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e namespace=[k8s.io](http://k8s.io/)

It errors as it fails to delete/cleanup shim

DEFAULT 2023-09-25T18:31:05.877264800Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:31:05.705075200Z" level=error msg="failed to delete shim" error="1 error occurred:\n\t* close wait error: context deadline exceeded\n\n" id=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e
DEFAULT 2023-09-25T18:31:05.877267600Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:31:05.711928200Z" level=error msg="failed to delete" cmd="C:\\etc\\kubernetes\\node\\bin\\containerd-shim-runhcs-v1.exe -namespace [k8s.io](http://k8s.io/) -address \\\\.\\pipe\\containerd-containerd -publish-binary C:\\etc\\kubernetes\\node\\bin\\containerd.exe -id 8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e -bundle C:\\ProgramData\\containerd\\state\\io.containerd.runtime.v2.task\\[k8s.io](http://k8s.io/)\\8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e delete" error="exit status 1"
DEFAULT 2023-09-25T18:31:05.877267900Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:31:05.711928200Z" level=warning msg="failed to clean up after shim disconnected" error="time=\"2023-09-25T18:31:00Z\" level=info msg=Span duration=3.903ms endTime=\"2023-09-25 18:31:00.7236973 +0000 GMT m=+0.011155201\" name=HcsOpenComputeSystem parentSpanID=572166ce62a49988 spanID=209b1bf3bf0cca9d startTime=\"2023-09-25 18:31:00.7197943 +0000 GMT m=+0.007252201\" traceID=be558e7d9f1d187e3aea631df78e85c0\ntime=\"2023-09-25T18:31:00Z\" level=info msg=Span duration=0s endTime=\"2023-09-25 18:31:00.7236976 +0000 GMT m=+0.011155201\" name=HcsRegisterComputeSystemCallback parentSpanID=572166ce62a49988 spanID=e914d1171d75fab2 startTime=\"2023-09-25 18:31:00.7236976 +0000 GMT m=+0.011155201\" traceID=be558e7d9f1d187e3aea631df78e85c0\ntime=\"2023-09-25T18:31:00Z\" level=info msg=Span duration=\"978.1µs\" endTime=\"2023-09-25 18:31:00.7246757 +0000 GMT m=+0.012133301\" name=HcsGetComputeSystemProperties parentSpanID=572166ce62a49988 propertyQuery=\"{}\" spanID=7b083c61d7455bdd startTime=\"2023-09-25 18:31:00.7236976 +0000 GMT m=+0.011155201\" traceID=be558e7d9f1d187e3aea631df78e85c0\ntime=\"2023-09-25T18:31:00Z\" level=info msg=Span duration=0s endTime=\"2023-09-25 18:31:00.7246758 +0000 GMT m=+0.012133301\" name=HcsTerminateComputeSystem options= parentSpanID=572166ce62a49988 spanID=4482f18534ae8bf2 startTime=\"2023-09-25 18:31:00.7246758 +0000 GMT m=+0.012133301\" traceID=be558e7d9f1d187e3aea631df78e85c0\n: exit status 1" id=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e namespace=[k8s.io](http://k8s.io/)

Cleanup error:

C:\\etc\\kubernetes\\node\\bin\\containerd-shim-runhcs-v1.exe -namespace [k8s.io](http://k8s.io/) -address \\\\.\\pipe\\containerd-containerd -publish-binary C:\\etc\\kubernetes\\node\\bin\\containerd.exe -id 8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e -bundle C:\\ProgramData\\containerd\\state\\io.containerd.runtime.v2.task\\[k8s.io](http://k8s.io/)\\8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e delete --> exit code 1

Subsequent calls to recreate will fail as name is still reserved (it seems failed sandbox is still not removed)

ERROR 2023-09-25T18:31:13.346487Z [resource.labels.nodeName: gke-a55193-nlfp] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to reserve sandbox name \"smb-server-monitoring-csi-cron-win-28261106-72b9j_samba_a01fd019-7a48-4255-9a3b-1928f8b5958f_0\": name \"smb-server-monitoring-csi-cron-win-28261106-72b9j_samba_a01fd019-7a48-4255-9a3b-1928f8b5958f_0\" is reserved for \"8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e\"" pod="samba/smb-server-monitoring-csi-cron-win-28261106-72b9j"
ERROR 2023-09-25T18:31:13.346487Z [resource.labels.nodeName: gke-a55193-nlfp] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to reserve sandbox name \"smb-server-monitoring-csi-cron-win-28261106-72b9j_samba_a01fd019-7a48-4255-9a3b-1928f8b5958f_0\": name \"smb-server-monitoring-csi-cron-win-28261106-72b9j_samba_a01fd019-7a48-4255-9a3b-1928f8b5958f_0\" is reserved for \"8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e\""

Trials to Stop the existing sandbox fails as well - seems it can't cleanup the network

DEFAULT 2023-09-25T18:32:05.878231700Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:32:05.748120200Z" level=error msg="Failed to destroy network for sandbox \"8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e\"" error="plugin type=\"sdnbridge\" name=\"l2bridge\" failed (delete): netplugin failed: \"{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[cni-net] Plugin wcn-net version .\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[net] Network interface: {Index:8 MTU:1500 Name:vEthernet (Ethernet) HardwareAddr:42:01:ac:17:a0:bb Flags:up|broadcast|multicast} with IP addresses: [fe80::b9e2:c457:141d:f89a/64 [172.23.160.187/23](http://172.23.160.187/23)]\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[net] Network interface: {Index:15 MTU:1500 Name:vEthernet (cbr0) HardwareAddr:00:15:5d:9e:0f:e1 Flags:up|broadcast|multicast} with IP addresses: [fe80::eb01:b4d0:6a0:4505/64 [172.30.2.2/24](http://172.30.2.2/24)]\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 [127.0.0.1/8](http://127.0.0.1/8)]\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[cni-net] Plugin started.\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[cni-net] Processing DEL command with args {ContainerID:8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e Netns:598f84b4-bb5e-4479-9def-ccd2b7129fe1 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=samba;K8S_POD_NAME=smb-server-monitoring-csi-cron-win-28261106-72b9j;K8S_POD_INFRA_CONTAINER_ID=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e;K8S_POD_UID=a01fd019-7a48-4255-9a3b-1928f8b5958f Path:C:\\\\\\\\etc\\\\\\\\kubernetes\\\\\\\\cni}\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[cni-net] Read network configuration \\\\u0026{CniVersion:0.2.0 Name:l2bridge Type:sdnbridge Ipam:{Type: Environment: AddrSpace: Subnet:[172.30.2.0/24](http://172.30.2.0/24) Address: QueryInterval: Routes:[{Dst:{IP:\\\\u003cnil\\\\u003e Mask:\\\\u003cnil\\\\u003e} GW:172.30.2.2}]} DNS:{Nameservers:[172.31.0.10] Domain: Search:[cluster.local] Options:[]} OptionalFlags:{LocalRoutePortMapping:false AllowAclPortMapping:false ForceBridgeGateway:false} RuntimeConfig:{PortMappings:[] DNS:{Servers:[172.31.0.10] Searches:[samba.svc.cluster.local svc.cluster.local cluster.local c.cpl-st…
DEFAULT 2023-09-25T18:32:05.878236300Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T18:32:05.748120200Z" level=error msg="encountered an error cleaning up failed sandbox \"8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e\", marking sandbox state as SANDBOX_UNKNOWN" error="plugin type=\"sdnbridge\" name=\"l2bridge\" failed (delete): netplugin failed: \"{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[cni-net] Plugin wcn-net version .\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[net] Network interface: {Index:8 MTU:1500 Name:vEthernet (Ethernet) HardwareAddr:42:01:ac:17:a0:bb Flags:up|broadcast|multicast} with IP addresses: [fe80::b9e2:c457:141d:f89a/64 [172.23.160.187/23](http://172.23.160.187/23)]\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[net] Network interface: {Index:15 MTU:1500 Name:vEthernet (cbr0) HardwareAddr:00:15:5d:9e:0f:e1 Flags:up|broadcast|multicast} with IP addresses: [fe80::eb01:b4d0:6a0:4505/64 [172.30.2.2/24](http://172.30.2.2/24)]\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 [127.0.0.1/8](http://127.0.0.1/8)]\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[cni-net] Plugin started.\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[cni-net] Processing DEL command with args {ContainerID:8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e Netns:598f84b4-bb5e-4479-9def-ccd2b7129fe1 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=samba;K8S_POD_NAME=smb-server-monitoring-csi-cron-win-28261106-72b9j;K8S_POD_INFRA_CONTAINER_ID=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e;K8S_POD_UID=a01fd019-7a48-4255-9a3b-1928f8b5958f Path:C:\\\\\\\\etc\\\\\\\\kubernetes\\\\\\\\cni}\\\",\\\"time\\\":\\\"2023-09-25T18:31:05Z\\\"}\\n{\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"[cni-net] Read network configuration \\\\u0026{CniVersion:0.2.0 Name:l2bridge Type:sdnbridge Ipam:{Type: Environment: AddrSpace: Subnet:[172.30.2.0/24](http://172.30.2.0/24) Address: QueryInterval: Routes:[{Dst:{IP:\\\\u003cnil\\\\u003e Mask:\\\\u003cnil\\\\u003e} GW:172.30.2.2}]} DNS:{Nameservers:[172.31.0.10] Domain: Search:[cluster.local] Options:[]} OptionalFlags:{LocalRoutePortMapping:false AllowAclPortMapping:false ForceBridgeGateway:false} RuntimeConfig:{PortMappings:[] DNS:{Servers:[172.31.0.10] Searches:[[samba.svc.cl](http://samba.svc.cl/)…
DEFAULT 2023-09-25T19:03:26.785284Z [resource.labels.nodeName: gke-a55193-nlfp] time="2023-09-25T19:03:26.658057400Z" level=info msg="StopPodSandbox for \"8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e\""

Stays in a loop with both StopSandbox & CreateSandbox calls both failing in a similar way. and sandbox remains in UNKNOWN state

msg="encountered an error cleaning up failed sandbox \"8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e\", marking sandbox state as SANDBOX_UNKNOWN

Error with timeout (@ 18:31:00):

"Syscall did not complete within operation timeout. This may indicate a platform issue. If it appears to be making no forward progress, obtain the stacks and see if there is a syscall stuck in the platform API for a significant length of time.", "2023-09-25T18:31:00.686362700Z", "f2f94b15835ed492",     240000000000, "4d12b06b9b2d3045f06e86bfe23ab12e"

ibabou commented 1 year ago

Hey @marosset, a gentle ping if there is any updates?

fady-azmy-msft commented 11 months ago

Thanks for flagging this @ibabou, how frequent have you seen this timeout issue occurring? And does does it affect one pod creation at a time, or multiple pod creations on the same node over a given period of time?

kiashok commented 11 months ago

Looking into this. The issue being described here seems similar to https://github.com/kubernetes/kubernetes/issues/107561

kiashok commented 11 months ago

@ibabou what CNI and version is being used? How many pods/containers does the node have?

ibabou commented 11 months ago

Thanks for looking into this!

@fady-azmy-msft , the customers see the issue maybe once every 2-3 weeks, but again we might have missed it if it happened in more clusters and went unnoticed. regarding the occurrences, what I have noticed it'll affect one pod & this will keep stuck as mentioned above for long period. the trials to recreate the sandbox will keep failing, although other pods seems to schedule fine or auto-recover with the recreation.

@kiashok Agree, the same behaviour matches what James describes in (2) here. I see the same log with the initial timeout due to syscall and later failing to delete the shim or recreate the sandbox. Tagging @jsturtevant here as well. About the number of pods, I don't have an exact number as this cluster/nodepool is not available now anymore - but I noticed that other pods were getting created without getting in this stuck state. there was no CPU or memory starvation (not sure if maybe multiple pods being created simultaneously could be a trigger).

The sdn cni used is v0.2. It's built up to this commit.

kiashok commented 11 months ago

Thanks for looking into this!

@fady-azmy-msft , the customers see the issue maybe once every 2-3 weeks, but again we might have missed it if it happened in more clusters and went unnoticed. regarding the occurrences, what I have noticed it'll affect one pod & this will keep stuck as mentioned above for long period. the trials to recreate the sandbox will keep failing, although other pods seems to schedule fine or auto-recover with the recreation.

@kiashok Agree, the same behaviour matches what James describes in (2) here. I see the same log with the initial timeout due to syscall and later failing to delete the shim or recreate the sandbox. Tagging @jsturtevant here as well. About the number of pods, I don't have an exact number as this cluster/nodepool is not available now anymore - but I noticed that other pods were getting created without getting in this stuck state. there was no CPU or memory starvation (not sure if maybe multiple pods being created simultaneously could be a trigger).

The sdn cni used is v0.2. It's built up to this commit.

@ibabou we are continuing to take a look. Will keep you posted on the updates.

Primarily seeing two main failures from the logs you shared with @MikeZappa87 and I'm suspecting this could be some contention happening at scale although I am not very sure of that just yet. The networking team is also taking a look. Do you have any repro steps that you could share?

[Wednesday 12:06 PM] Kirtana Ashok plugin type="sdnbridge" name="l2bridge" failed (delete): netplugin failed: "{\"level\":\"debug\",\"msg\":\"[cni-net] Plugin wcn-net version .\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"[net] Network interface: {Index:8 MTU:1500 Name:vEthernet (Ethernet) HardwareAddr:42:01:ac:17:a0:bb Flags:up|broadcast|multicast} with IP addresses: [fe80::b9e2:c457:141d:f89a/64 172.23.160.187/23]\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"[net] Network interface: {Index:15 MTU:1500 Name:vEthernet (cbr0) HardwareAddr:00:15:5d:9e:0f:e1 Flags:up|broadcast|multicast} with IP addresses: [fe80::eb01:b4d0:6a0:4505/64 172.30.2.2/24]\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"[net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 127.0.0.1/8]\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"[cni-net] Plugin started.\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"[cni-net] Processing DEL command with args {ContainerID:8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e Netns:598f84b4-bb5e-4479-9def-ccd2b7129fe1 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=samba;K8S_POD_NAME=smb-server-monitoring-csi-cron-win-28261106-72b9j;K8S_POD_INFRA_CONTAINER_ID=8e8bbc9d6b1f6ee414744487b6a4a20df7cb3440131061f98a3a566496080c0e;K8S_POD_UID=a01fd019-7a48-4255-9a3b-1928f8b5958f Path:C:\\etc\\kubernetes\\cni}\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"[cni-net] Read network configuration \u0026{CniVersion:0.2.0 Name:l2bridge Type:sdnbridge Ipam:{Type: Environment: AddrSpace: Subnet:172.30.2.0/24 Address: QueryInterval: Routes:[{Dst:{IP:\u003cnil\u003e Mask:\u003cnil\u003e} GW:172.30.2.2}]} DNS:{Nameservers:[172.31.0.10] Domain: Search:[cluster.local] Options:[]} OptionalFlags:{LocalRoutePortMapping:false AllowAclPortMapping:false ForceBridgeGateway:false} RuntimeConfig:{PortMappings:[] DNS:{Servers:[172.31.0.10] Searches:[samba.svc.cluster.local svc.cluster.local cluster.local c.cpl-sta-l-app-devlab-01.internal google.internal us-west1-c.c.cpl-sta-l-app-devlab-01.internal] Options:[ndots:5]}} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 69 120 99 101 112 116 105 111 110 115 34 58 91 34 49 54 57 46 50 53 52 46 48 46 48 47 49 54 34 44 34 49 48 46 48 46 48 46 48 47 56 34 44 34 49 55 50 46 49 54 46 48 46 48 47 49 50 34 44 34 49 57 50 46 49 54 56 46 48 46 48 47 49 54 34 44 34 49 48 48 46 54 52 46 48 46 48 47 49 48 34 44 34 49 57 50 46 48 46 48 46 48 47 50 52 34 44 34 49 57 50 46 48 46 50 46 48 47 50 52 34 44 34 49 57 50 46 56 56 46 57 57 46 48 47 50 52 34 44 34 49 57 56 46 49 56 46 48 46 48 47 49 53 34 44 34 49 57 56 46 53 49 46 49 48 48 46 48 47 50 52 34 44 34 50 48 51 46 48 46 49 49 51 46 48 47 50 52 34 44 34 50 52 48 46 48 46 48 46 48 47 52 34 93 125 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 55 50 46 51 49 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 125 44 34 84 121 112 101 34 58 34 83 68 78 82 111 117 116 101 34 125]} {Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 55 50 46 50 51 46 49 54 48 46 49 56 55 47 51 50 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 125 44 34 84 121 112 101 34 58 34 83 68 78 82 111 117 116 101 34 125]}]}.\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"Substituting RuntimeConfig DNS Nameservers: [172.31.0.10]\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"Substituting RuntimeConfig DNS Search: [samba.svc.cluster.local svc.cluster.local cluster.local c.cpl-sta-l-app-devlab-01.internal google.internal us-west1-c.c.cpl-sta-l-app-devlab-01.internal]\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"Substituting RuntimeConfig DNS Options: [ndots:5]\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"Parsing port mappings from []\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"hcn::HostComputeNamespace::RemoveNamespaceEndpoint id=52f6681f-c61e-4dc9-b74b-35b5980e4c0c\",\"time\":\"2023-09-25T18:31:05Z\"}\n{\"level\":\"debug\",\"msg\":\"hcn::HostComputeNamespace::ModifyNamespaceSettings id=598f84b4-bb5e-4479-9def-ccd2b7129fe1\",\"time\":\"2023-09-25T18:31:05Z\"}\n"

failed to destroy network for sandbox "52ad56234561203338af081d24262de1a9b432456c942569f80d2459c40b831f": plugin type="sdnbridge" name="l2bridge" failed (delete): netplugin failed: "{\"level\":\"debug\",\"msg\":\"[cni-net] Plugin wcn-net version .\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"[net] Network interface: {Index:8 MTU:1500 Name:vEthernet (Ethernet) HardwareAddr:42:01:ac:17:a0:bb Flags:up|broadcast|multicast} with IP addresses: [fe80::b9e2:c457:141d:f89a/64 172.23.160.187/23]\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"[net] Network interface: {Index:15 MTU:1500 Name:vEthernet (cbr0) HardwareAddr:00:15:5d:9e:0f:e1 Flags:up|broadcast|multicast} with IP addresses: [fe80::eb01:b4d0:6a0:4505/64 172.30.2.2/24]\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"[net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 127.0.0.1/8]\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"[cni-net] Plugin started.\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"[cni-net] Processing DEL command with args {ContainerID:52ad56234561203338af081d24262de1a9b432456c942569f80d2459c40b831f Netns:7c806792-8457-4c13-a8c2-fc8a9c476d73 IfName:eth0 Args:K8S_POD_NAME=smb-server-monitoring-csi-cron-win-28261076-7jcsb;K8S_POD_INFRA_CONTAINER_ID=52ad56234561203338af081d24262de1a9b432456c942569f80d2459c40b831f;K8S_POD_UID=7f0ce019-37c7-4098-aee6-51936bbdfd88;IgnoreUnknown=1;K8S_POD_NAMESPACE=samba Path:C:\\etc\\kubernetes\\cni}\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"[cni-net] Read network configuration \u0026{CniVersion:0.2.0 Name:l2bridge Type:sdnbridge Ipam:{Type: Environment: AddrSpace: Subnet:172.30.2.0/24 Address: QueryInterval: Routes:[{Dst:{IP:\u003cnil\u003e Mask:\u003cnil\u003e} GW:172.30.2.2}]} DNS:{Nameservers:[172.31.0.10] Domain: Search:[cluster.local] Options:[]} OptionalFlags:{LocalRoutePortMapping:false AllowAclPortMapping:false ForceBridgeGateway:false} RuntimeConfig:{PortMappings:[] DNS:{Servers:[172.31.0.10] Searches:[samba.svc.cluster.local svc.cluster.local cluster.local c.cpl-sta-l-app-devlab-01.internal google.internal us-west1-c.c.cpl-sta-l-app-devlab-01.internal] Options:[ndots:5]}} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 69 120 99 101 112 116 105 111 110 115 34 58 91 34 49 54 57 46 50 53 52 46 48 46 48 47 49 54 34 44 34 49 48 46 48 46 48 46 48 47 56 34 44 34 49 55 50 46 49 54 46 48 46 48 47 49 50 34 44 34 49 57 50 46 49 54 56 46 48 46 48 47 49 54 34 44 34 49 48 48 46 54 52 46 48 46 48 47 49 48 34 44 34 49 57 50 46 48 46 48 46 48 47 50 52 34 44 34 49 57 50 46 48 46 50 46 48 47 50 52 34 44 34 49 57 50 46 56 56 46 57 57 46 48 47 50 52 34 44 34 49 57 56 46 49 56 46 48 46 48 47 49 53 34 44 34 49 57 56 46 53 49 46 49 48 48 46 48 47 50 52 34 44 34 50 48 51 46 48 46 49 49 51 46 48 47 50 52 34 44 34 50 52 48 46 48 46 48 46 48 47 52 34 93 125 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 55 50 46 51 49 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 125 44 34 84 121 112 101 34 58 34 83 68 78 82 111 117 116 101 34 125]} {Name:EndpointPolicy Value:[123 34 83 101 116 116 105 110 103 115 34 58 123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 55 50 46 50 51 46 49 54 48 46 49 56 55 47 51 50 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 125 44 34 84 121 112 101 34 58 34 83 68 78 82 111 117 116 101 34 125]}]}.\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"Substituting RuntimeConfig DNS Nameservers: [172.31.0.10]\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"Substituting RuntimeConfig DNS Search: [samba.svc.cluster.local svc.cluster.local cluster.local c.cpl-sta-l-app-devlab-01.internal google.internal us-west1-c.c.cpl-sta-l-app-devlab-01.internal]\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"Substituting RuntimeConfig DNS Options: [ndots:5]\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"Parsing port mappings from []\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"hcn::HostComputeNamespace::RemoveNamespaceEndpoint id=2a352669-639b-4d85-9b4d-7b459a39b19e\",\"time\":\"2023-09-25T18:30:26Z\"}\n{\"level\":\"debug\",\"msg\":\"hcn::HostComputeNamespace::ModifyNamespaceSettings id=7c806792-8457-4c13-a8c2-fc8a9c476d73\",\"time\":\"2023-09-25T18:30:26Z\"}\n"

kiashok commented 11 months ago

@ibabou just out of curiosity, this this issue being hit with an updated WS2019 version and containerd/1.7? From the previous github issues I linked, seems like folks didn't run into the issues they saw with containerd/1.7 . We will continue to investigate the bug seen with the current configuration that the cx is on, but if they have bandwidth to test with containerd/1.7, it might be helpful!

ibabou commented 11 months ago

@kiashok We don't have an easy repro unfortunately, it seems to happen on the customer's cluster from time to time. Is there any additional debug info we can ask to enable/collect to be ready in case an occurence happened?

with regards to containerd version, no all clusters are running on 1.6.X, we haven't moved yet to containerd 1.7. Is there a specific changes with containerd or hcsshim that can relate to this not occuring on 1.7 compared to 1.6?

kiashok commented 11 months ago

@kiashok We don't have an easy repro unfortunately, it seems to happen on the customer's cluster from time to time. Is there any additional debug info we can ask to enable/collect to be ready in case an occurence happened?

with regards to containerd version, no all clusters are running on 1.6.X, we haven't moved yet to containerd 1.7. Is there a specific changes with containerd or hcsshim that can relate to this not occuring on 1.7 compared to 1.6?

I've seen that density runs (high parallel pod creation) improve with 1.28 k8s and containerd/1.7 . Haven't dug into what set of commits might be helping with this.

kiashok commented 11 months ago

@ibabou by sdn CNI do you mean Azure CNI? Do you mean sdnbridge? What CNI is the cx using?

ibabou commented 11 months ago

@kiashok no, that's not an Azure CNI. this runs on GKE, and sdnbridge cni used: https://github.com/microsoft/windows-container-networking.

The cni config settings is as follows: "cniVersion": "0.2.0", "name": "l2bridge", "type": "sdnbridge", "master": "Ethernet", "capabilities": { "portMappings": true, "dns": true },

ibabou commented 11 months ago

Hey @kiashok , any other updates on the further investigation?

ntrappe-msft commented 11 months ago

🔖 ADO 47828142

microsoft-github-policy-service[bot] commented 9 months ago

This issue has been open for 30 days with no updates. @kiashok, please provide an update or close this issue.

microsoft-github-policy-service[bot] commented 8 months ago

This issue has been open for 30 days with no updates. @kiashok, please provide an update or close this issue.

fady-azmy-msft commented 8 months ago

Apologies for the radio silence @ibabou, is this still an issue you're hitting?

If so, could you share what K8s version the customer is on and if they can reproduce the issue with WS 2022 version? At some point when we looked at this and suspected this might be related to the HNS, which we have shipped a lot fixes to in WS 2022.

Also, as @kiashok mentioned in the thread, we have seen better pod density with 1.28 and containerd 1.7 and upgrading to this may reduce overall frequency of this issue in case it's related to resource density.

ibabou commented 8 months ago

Hey @fady-azmy-msft , thanks for circling back on this! Yeah, we've the customer still experiencing the issue, especially with high packed density. We have tried out containerd 1.6.24 but haven't seen improvements either - containerd 1.7 is still out of scope but we're considering to test out with it.

I agree that seems HNS is suspected here. I'll follow-up on Windows 2022 usage, and ask our team to follow-up on validating if we can see improvements as compared to WS2019 LTSC.

fady-azmy-msft commented 8 months ago

That sounds great @ibabou . I'll wait to hear from you if WS 2022 is still triggering the issue.

microsoft-github-policy-service[bot] commented 7 months ago

This issue has been open for 30 days with no updates. @kiashok, please provide an update or close this issue.

fady-azmy-msft commented 7 months ago

Hey @ibabou, I'm going to close the issue since it's been a while for the repro; however once this occurs again please open this issue again and we'll look at this. Appreciate you highlighting this to us.

microsoft / Windows-Containers

Hanging Pod sandboxes that timeouts during creation and fail to be cleaned up #430