microsoft / Windows-Containers

Welcome to our Windows Containers GitHub community! Ask questions, report bugs, and suggest features -- let's work together.
MIT License
381 stars 59 forks source link

Requesting elaboration for HCN error 2151350293 (0x803b0015) #485

Open tzifudzi opened 2 months ago

tzifudzi commented 2 months ago

Summary

I am requesting more details about the HCN error code 2151350293 (0x803b0015). In a scenario I am facing, this error is returned during the HcnDeleteNamespace operation when invoked by containerd via hcsshim.

To help me troubleshoot an ongoing issue and perhaps also benefit the broader community,

Further detail

When rapidly scaling containers (up and down) on a Kubernetes Windows node running containerd, I am sporadically encountering errors while trying to terminate Kubernetes pods. The pods get stuck in terminating status while the containerd tries to remove the network namespace using the HcnDeleteNamespace operation. The error message in the logs is as follows:

hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) 
{
  "Success": false,
  "Error": "The specified request is unsupported.",
  "ErrorCode": 2151350293
}

More context

Current theories being explored for which more error verbosity can help rule out

Environment

Similar Issues

These issues cite the same error code might benefit from this issue being responded to

grcusanz commented 2 months ago

@tzifudzi there are two likely conditions that can result in HCN_E_REQUEST_UNSUPPORTED to get returned when deleting a Namespace:

  1. The Guid specified for the compartment does not have the correct Guid syntax as required by UuidFromString.
  2. The namespace still has one or more active container attached, but in this case the error text should say it's because there are "active containers using the namespace". It's possible that the code gets returned by some other function and is passed through, but in my review I didn't see any places where that was likely.

Can you provide some more output from the logs before and after that message? That might help me narrow down exactly where this came from.

VincentVTran commented 2 months ago

Hey @grcusanz, thanks for the response! Regarding the request for additional logs, the error that we retrieved that error from was found in the containerd logs. Below is a larger snippet of the error:

time="2024-04-12T19:57:29.990801800Z" level=info msg="TearDown network for sandbox \"a4d543b8da2dc3d287983f5620467a33325bd0bc9eeb41bf3112cc4c84a237b1\" successfully"
time="2024-04-12T19:57:29.996372800Z" level=debug msg="event published" ns=k8s.io topic=/snapshot/remove type=containerd.events.SnapshotRemove
time="2024-04-12T19:57:30.001077700Z" level=debug msg="event published" ns=k8s.io topic=/containers/delete type=containerd.events.ContainerDelete
time="2024-04-12T19:57:30.001077700Z" level=info msg="RemovePodSandbox \"a4d543b8da2dc3d287983f5620467a33325bd0bc9eeb41bf3112cc4c84a237b1\" returns successfully"
time="2024-04-12T19:57:31.129178800Z" level=debug msg="schedule snapshotter cleanup" snapshotter=windows
time="2024-04-12T19:57:31.135682500Z" level=debug msg="removed snapshot" key=k8s.io/1032/41af9a400affa232040807aa5ffa7ee41dbf26fbc38948e522ae163206e51895 snapshotter=windows
time="2024-04-12T19:57:31.142322200Z" level=debug msg="removed snapshot" key=k8s.io/1039/a4d543b8da2dc3d287983f5620467a33325bd0bc9eeb41bf3112cc4c84a237b1 snapshotter=windows
time="2024-04-12T19:57:31.147351500Z" level=debug msg="removed snapshot" key=k8s.io/1042/9d1e037335545f093b77d024f52169e8f09d50b6c6c8b806fb3defb87a1420e3 snapshotter=windows
time="2024-04-12T19:57:31.153209500Z" level=debug msg="removed snapshot" key=k8s.io/1043/05a92af9763e79e99b3503cadd01776521264d301ba80adcead32693f25f3500 snapshotter=windows
time="2024-04-12T19:57:31.157864600Z" level=debug msg="removed snapshot" key=k8s.io/1044/8e360f7707b8e17294120e2d5624d59175a239cbe18a8ed6fefd20033468a3d0 snapshotter=windows
time="2024-04-12T19:57:31.163708100Z" level=debug msg="removed snapshot" key=k8s.io/1045/c46e060d19f293ed949b3961ddbe567fa06be360a18007db2d2b5593f9d1f1bb snapshotter=windows
time="2024-04-12T19:57:31.168566100Z" level=debug msg="removed snapshot" key=k8s.io/1046/3110b381d062a44a4030cec60c90fcc9ecd655699af86719b3e5766609ea2953 snapshotter=windows
time="2024-04-12T19:57:31.174401200Z" level=debug msg="removed snapshot" key=k8s.io/1048/3824d800bb125e33451564a2835f3663b4f796bdc7637fca1692d46a5fb5367d snapshotter=windows
time="2024-04-12T19:57:31.179292100Z" level=debug msg="removed snapshot" key=k8s.io/1049/2fb021b59b6cd8f6f56617b6721bf246baeb79f7cdcb90ed48665e7c58feaf8c snapshotter=windows
time="2024-04-12T19:57:31.183533900Z" level=debug msg="removed snapshot" key=k8s.io/1050/8354af5e0577bec700ef447f6e3d9b284efb9d88aae296b695361fa892b8b468 snapshotter=windows
time="2024-04-12T19:57:31.190051300Z" level=debug msg="removed snapshot" key=k8s.io/1051/5a53275dc0a6ecaecc040b355988e8bcdbf7e02cb5c52e5e727a960c663d07d8 snapshotter=windows
time="2024-04-12T19:57:31.194739900Z" level=debug msg="removed snapshot" key=k8s.io/1053/b334753367407494ed7d4bf8f11028b48a8ad7bd8aa9e1856477c56cf0cf437a snapshotter=windows
time="2024-04-12T19:57:31.199417500Z" level=debug msg="removed snapshot" key=k8s.io/1056/39b901dd6757e26bcf3f08606df943fa772641b5b405b98fa85cd999dff46e46 snapshotter=windows
time="2024-04-12T19:57:31.204255000Z" level=debug msg="removed snapshot" key=k8s.io/1057/ce33f44c105046b0336c7aa698673d1820163ca9c02e853cc13d663f844bc936 snapshotter=windows
time="2024-04-12T19:57:31.204255000Z" level=debug msg="snapshot garbage collected" d=75.0709ms snapshotter=windows
time="2024-04-12T19:57:31.204255000Z" level=debug msg="garbage collected" d=3.5852ms
time="2024-04-12T19:57:33.423847500Z" level=error msg="failed to delete" cmd="C:\\Program Files\\containerd\\containerd-shim-runhcs-v1.exe -namespace k8s.io -address \\\\.\\pipe\\containerd-containerd -publish-binary C:\\Program Files\\containerd\\containerd.exe -id ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae -bundle C:\\ProgramData\\containerd\\state\\io.containerd.runtime.v2.task\\k8s.io\\ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae delete" error="exit status 1"
time="2024-04-12T19:57:33.423847500Z" level=warning msg="failed to clean up after shim disconnected" error="time=\"2024-04-12T19:57:28Z\" level=info msg=Span duration=2.7429ms endTime=\"2024-04-12 19:57:28.4162491 +0000 UTC m=+0.009291601\" name=HcsOpenComputeSystem parentSpanID=99953dd45ccca59b spanID=aa9324eee8cb09e3 startTime=\"2024-04-12 19:57:28.4135062 +0000 UTC m=+0.006548701\" traceID=c226c1b664ab32cedda1765c4b9fece7\ntime=\"2024-04-12T19:57:28Z\" level=info msg=Span duration=1.5648ms endTime=\"2024-04-12 19:57:28.4300825 +0000 UTC m=+0.023124001\" name=HcsRegisterComputeSystemCallback parentSpanID=99953dd45ccca59b spanID=bb910b0875cb6d2a startTime=\"2024-04-12 19:57:28.4285177 +0000 UTC m=+0.021559201\" traceID=c226c1b664ab32cedda1765c4b9fece7\ntime=\"2024-04-12T19:57:28Z\" level=info msg=Span duration=0s endTime=\"2024-04-12 19:57:28.4300826 +0000 UTC m=+0.023124001\" name=HcsGetComputeSystemProperties parentSpanID=99953dd45ccca59b propertyQuery=\"{}\" spanID=dd8dd93b8dca35b9 startTime=\"2024-04-12 19:57:28.4300826 +0000 UTC m=+0.023124001\" traceID=c226c1b664ab32cedda1765c4b9fece7\ntime=\"2024-04-12T19:57:28Z\" level=info msg=Span duration=\"50.7µs\" endTime=\"2024-04-12 19:57:28.4307711 +0000 UTC m=+0.023812401\" name=HcsTerminateComputeSystem options= parentSpanID=99953dd45ccca59b spanID=ee8bc05519ca9900 startTime=\"2024-04-12 19:57:28.4307204 +0000 UTC m=+0.023761701\" traceID=c226c1b664ab32cedda1765c4b9fece7\n: exit status 1" id=ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae namespace=k8s.io
time="2024-04-12T19:57:33.553381900Z" level=debug msg="hcn::HostComputeNamespace::Delete id=2a5a8d11-7d97-4f33-8460-b43ef9a82720"
time="2024-04-12T19:57:33.553457600Z" level=debug msg="hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
time="2024-04-12T19:57:33.553457600Z" level=error msg="StopPodSandbox for \"ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae\" failed" error="failed to remove network namespace for sandbox \"ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae\": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
time="2024-04-12T19:57:34.294856400Z" level=info msg="StopPodSandbox for \"ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae\""
time="2024-04-12T19:57:34.294923000Z" level=info msg="Container to stop \"303309f8aa9925bb11538019de0e86f0585f5471bd9a5acf4dec3de0db7a102f\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
time="2024-04-12T19:57:34.342050600Z" level=debug msg="hcn::HostComputeNamespace::Delete id=2a5a8d11-7d97-4f33-8460-b43ef9a82720"
time="2024-04-12T19:57:34.342050600Z" level=debug msg="hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
time="2024-04-12T19:57:34.342050600Z" level=error msg="StopPodSandbox for \"ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae\" failed" error="failed to remove network namespace for sandbox \"ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae\": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
time="2024-04-12T19:57:35.301475100Z" level=info msg="StopPodSandbox for \"ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae\""
time="2024-04-12T19:57:35.301475100Z" level=info msg="Container to stop \"303309f8aa9925bb11538019de0e86f0585f5471bd9a5acf4dec3de0db7a102f\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
time="2024-04-12T19:57:35.360882800Z" level=debug msg="hcn::HostComputeNamespace::Delete id=2a5a8d11-7d97-4f33-8460-b43ef9a82720"
time="2024-04-12T19:57:35.360882800Z" level=debug msg="hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
time="2024-04-12T19:57:35.360882800Z" level=error msg="StopPodSandbox for \"ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae\" failed" error="failed to remove network namespace for sandbox \"ad19ecce43501e00c9a1223d6315a5b1536aee0bec7a487ffa96a92d7ec961ae\": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
time="2024-04-12T19:57:37.277468700Z" level=debug msg=ImageFsInfo
time="2024-04-12T19:57:37.277468700Z" level=debug msg="ImageFsInfo returns filesystem info [&FilesystemUsage{Timestamp:1712951854017475500,FsId:&FilesystemIdentifier{Mountpoint:C:\\ProgramData\\containerd\\root\\io.containerd.snapshotter.v1.windows,},UsedBytes:&UInt64Value{Value:12546299132,},InodesUsed:&UInt64Value{Value:0,},}]"
time="2024-04-12T19:57:41.663276900Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:sch-repro-2-5f57b78546-j6xsn,Uid:61468f83-ebe5-4ff4-9275-aad74afe9f2f,Namespace:default,Attempt:0,}"
time="2024-04-12T19:57:41.663276900Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:sch-repro-2-5f57b78546-xrp9z,Uid:de789ada-682b-4f66-95c3-a2702f3aaac3,Namespace:default,Attempt:0,}"
time="2024-04-12T19:57:41.663276900Z" level=debug msg="Sandbox config &PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:sch-repro-2-5f57b78546-j6xsn,Uid:61468f83-ebe5-4ff4-9275-aad74afe9f2f,Namespace:default,Attempt:0,},Hostname:sch-repro-2-5f57b78546-j6xsn,LogDirectory:\\var\\log\\pods\\default_sch-repro-2-5f57b78546-j6xsn_61468f83-ebe5-4ff4-9275-aad74afe9f2f,DnsConfig:&DNSConfig{Servers:[172.20.0.10],Searches:[default.svc.cluster.local svc.cluster.local cluster.local],Options:[ndots:5],},PortMappings:[]*PortMapping{&PortMapping{Protocol:TCP,ContainerPort:80,HostPort:0,HostIp:,},},Labels:map[string]string{app: sch-repro-2,io.kubernetes.pod.name: sch-repro-2-5f57b78546-j6xsn,io.kubernetes.pod.namespace: default,io.kubernetes.pod.uid: 61468f83-ebe5-4ff4-9275-aad74afe9f2f,pod-template-hash: 5f57b78546,},Annotations:map[string]string{kubernetes.io/config.seen: 2024-04-12T19:57:41.360661100Z,kubernetes.io/config.source: api,},Linux:&LinuxPodSandboxConfig{CgroupParent:,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,UsernsOptions:nil,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,Seccomp:&SecurityProfile{ProfileType:RuntimeDefault,LocalhostRef:,},Apparmor:nil,},Sysctls:map[string]string{},Overhead:nil,Resources:nil,},Windows:&WindowsPodSandboxConfig{SecurityContext:&WindowsSandboxSecurityContext{RunAsUsername:,CredentialSpec:{\"ActiveDirectoryConfig\":{\"GroupManagedServiceAccounts\":[{\"Name\":\"foouser1\",\"Scope\":\"gmsa.tests.com\"},{\"Name\":\"foouser1\",\"Scope\":\"tests\"}],\"HostAccountConfig\":{\"PluginGUID\":\"{859E1386-BDB4-49E8-85C7-3070B13920E1}\",\"PluginInput\":\"{\\\"credentialArn\\\":\\\"arn:aws:ssm:us-west-2:671234669241:parameter/gmsa-plugin-input\\\"}\",\"PortableCcgVersion\":\"1\"}},\"CmsPlugins\":[\"ActiveDirectory\"],\"DomainJoinConfig\":{\"DnsName\":\"gmsa.tests.com\",\"DnsTreeName\":\"gmsa.tests.com\",\"Guid\":\"4e73610f-5f22-4298-a287-3906361e5346\",\"MachineAccountName\":\"foouser1\",\"NetBiosName\":\"tests\",\"Sid\":\"S-1-5-21-491546681-1899325268-1126201760\"}},HostProcess:false,},},}"
time="2024-04-12T19:57:41.663276900Z" level=debug msg="Sandbox config &PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:sch-repro-2-5f57b78546-xrp9z,Uid:de789ada-682b-4f66-95c3-a2702f3aaac3,Namespace:default,Attempt:0,},Hostname:sch-repro-2-5f57b78546-xrp9z,LogDirectory:\\var\\log\\pods\\default_sch-repro-2-5f57b78546-xrp9z_de789ada-682b-4f66-95c3-a2702f3aaac3,DnsConfig:&DNSConfig{Servers:[172.20.0.10],Searches:[default.svc.cluster.local svc.cluster.local cluster.local],Options:[ndots:5],},PortMappings:[]*PortMapping{&PortMapping{Protocol:TCP,ContainerPort:80,HostPort:0,HostIp:,},},Labels:map[string]string{app: sch-repro-2,io.kubernetes.pod.name: sch-repro-2-5f57b78546-xrp9z,io.kubernetes.pod.namespace: default,io.kubernetes.pod.uid: de789ada-682b-4f66-95c3-a2702f3aaac3,pod-template-hash: 5f57b78546,},Annotations:map[string]string{kubernetes.io/config.seen: 2024-04-12T19:57:41.356008100Z,kubernetes.io/config.source: api,},Linux:&LinuxPodSandboxConfig{CgroupParent:,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,UsernsOptions:nil,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,Seccomp:&SecurityProfile{ProfileType:RuntimeDefault,LocalhostRef:,},Apparmor:nil,},Sysctls:map[string]string{},Overhead:nil,Resources:nil,},Windows:&WindowsPodSandboxConfig{SecurityContext:&WindowsSandboxSecurityContext{RunAsUsername:,CredentialSpec:{\"ActiveDirectoryConfig\":{\"GroupManagedServiceAccounts\":[{\"Name\":\"foouser1\",\"Scope\":\"gmsa.tests.com\"},{\"Name\":\"foouser1\",\"Scope\":\"tests\"}],\"HostAccountConfig\":{\"PluginGUID\":\"{859E1386-BDB4-49E8-85C7-3070B13920E1}\",\"PluginInput\":\"{\\\"credentialArn\\\":\\\"arn:aws:ssm:us-west-2:671234669241:parameter/gmsa-plugin-input\\\"}\",\"PortableCcgVersion\":\"1\"}},\"CmsPlugins\":[\"ActiveDirectory\"],\"DomainJoinConfig\":{\"DnsName\":\"gmsa.tests.com\",\"DnsTreeName\":\"gmsa.tests.com\",\"Guid\":\"4e73610f-5f22-4298-a287-3906361e5346\",\"MachineAccountName\":\"foouser1\",\"NetBiosName\":\"tests\",\"Sid\":\"S-1-5-21-491546681-1899325268-1126201760\"}},HostProcess:false,},},}"
time="2024-04-12T19:57:41.663276900Z" level=debug msg="generated id for sandbox name \"sch-repro-2-5f57b78546-xrp9z_default_de789ada-682b-4f66-95c3-a2702f3aaac3_0\"" podsandboxid=0f97569863c0ac1c313dbed3892a08aa3bc3f0460db31c541f45dc5493970075
time="2024-04-12T19:57:41.663276900Z" level=debug msg="generated id for sandbox name \"sch-repro-2-5f57b78546-j6xsn_default_61468f83-ebe5-4ff4-9275-aad74afe9f2f_0\"" podsandboxid=848d3847af968f27cf6aef38370054c7451c0409243904dd97111ac161aa4ddc
time="2024-04-12T19:57:41.663276900Z" level=debug msg="use OCI runtime {Type:io.containerd.runhcs.v1 Path: Engine: PodAnnotations:[] ContainerAnnotations:[] Root: Options:map[Debug:true DebugType:2] 

I also tried looking through the WinEvent logs for HNS, but there wasn't anything that I saw of value. If there is any additional log that you would like me to fetch, please do let me know!

tzifudzi commented 2 months ago

@grcusanz Thanks for responding with early thoughts about this.

The Guid specified for the compartment does not have the correct Guid syntax...

I think we can rule out having an invalid Guid as I confirmed the syntax is correct and also is shown in the logs @VincentVTran shared. The GUID is 2a5a8d11-7d97-4f33-8460-b43ef9a82720 which meets the required format. i.e. xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx.

The namespace still has one or more active container attached...

Interesting theory. I think we can also rule this out based on your the fact that as you suggest we would have likely got the explicit message as described. Perhaps some HCS/HNS traces can help rule this out

VincentVTran commented 2 months ago

I had to make another reproduce environment, but was able to get the HNS log for the namespace delete request:


TimeCreated     : 4/17/2024 9:10:52 PM
Id              : 1055
Message         : RPC request received. Type: 'Delete', Entity: 'Namespace', Id: '{932331a3-b265-4fce-b4c8-079e70982379}', Access Level: 'Administrator', Data: ''
ExpandedMessage : RPC request received. Type: 'Delete', Entity: 'Namespace', Id: '{932331a3-b265-4fce-b4c8-079e70982379}', Access Level: 'Administrator', Data: ''

TimeCreated     : 4/17/2024 9:10:52 PM
Id              : 1055
Message         : RPC request received. Type: 'Query', Entity: 'Namespace', Id: '{932331a3-b265-4fce-b4c8-079e70982379}', Access Level: 'Administrator', Data: '{"ID":"932331a3-b265-4fce-b4c8-079e70982379","Version":55834574851,"Health":{"Extra":{"Resources":{"Additional
                  Params":{},"AllocationOrder":1,"Allocators":[{"AdditionalParams":{},"AllocationOrder":0,"CompartmentGuid":"E569301B-7B5C-4554-B067-22A255632754","CompartmentId":51,"CompartmentType":0,"Flags":0,"Health":{"LastErrorCode":0,"LastUpdateTime":1335785543376
                  01682},"ID":"CD46223C-7C6D-4AC2-BD1E-9D040B1EC3C5","IsPolicy":false,"State":3,"Tag":"Network Compartment"}],"CompartmentOperationTime":32,"Flags":0,"Health":{"LastErrorCode":0,"LastUpdateTime":133578554329309128},"ID":"AD089CEF-2BEA-4E04-B440-5554B64D7
                  C06","PortOperationTime":0,"State":1,"SwitchOperationTime":0,"VfpOperationTime":0}}},"SchemaVersion":{"Major":2,"Minor":0},"NamespaceId":51,"NamespaceGuid":"e569301b-7b5c-4554-b067-22a255632754"}'
ExpandedMessage : RPC request received. Type: 'Query', Entity: 'Namespace', Id: '{932331a3-b265-4fce-b4c8-079e70982379}', Access Level: 'Administrator', Data: '{"ID":"932331a3-b265-4fce-b4c8-079e70982379","Version":55834574851,"Health":{"Extra":{"Resources":{"Additional
                  Params":{},"AllocationOrder":1,"Allocators":[{"AdditionalParams":{},"AllocationOrder":0,"CompartmentGuid":"E569301B-7B5C-4554-B067-22A255632754","CompartmentId":51,"CompartmentType":0,"Flags":0,"Health":{"LastErrorCode":0,"LastUpdateTime":1335785543376
                  01682},"ID":"CD46223C-7C6D-4AC2-BD1E-9D040B1EC3C5","IsPolicy":false,"State":3,"Tag":"Network Compartment"}],"CompartmentOperationTime":32,"Flags":0,"Health":{"LastErrorCode":0,"LastUpdateTime":133578554329309128},"ID":"AD089CEF-2BEA-4E04-B440-5554B64D7
                  C06","PortOperationTime":0,"State":1,"SwitchOperationTime":0,"VfpOperationTime":0}}},"SchemaVersion":{"Major":2,"Minor":0},"NamespaceId":51,"NamespaceGuid":"e569301b-7b5c-4554-b067-22a255632754"}'

TimeCreated     : 4/17/2024 9:10:52 PM
Id              : 1055
Message         : RPC request received. Type: 'Query', Entity: 'Namespace', Id: '{932331a3-b265-4fce-b4c8-079e70982379}', Access Level: 'Administrator', Data: '{"ID":"932331a3-b265-4fce-b4c8-079e70982379","Version":55834574851,"Health":{"Extra":{"Resources":{"Additional
                  Params":{},"AllocationOrder":1,"Allocators":[{"AdditionalParams":{},"AllocationOrder":0,"CompartmentGuid":"E569301B-7B5C-4554-B067-22A255632754","CompartmentId":51,"CompartmentType":0,"Flags":0,"Health":{"LastErrorCode":0,"LastUpdateTime":1335785543376
                  01682},"ID":"CD46223C-7C6D-4AC2-BD1E-9D040B1EC3C5","IsPolicy":false,"State":3,"Tag":"Network Compartment"}],"CompartmentOperationTime":32,"Flags":0,"Health":{"LastErrorCode":0,"LastUpdateTime":133578554329309128},"ID":"AD089CEF-2BEA-4E04-B440-5554B64D7
                  C06","PortOperationTime":0,"State":1,"SwitchOperationTime":0,"VfpOperationTime":0}}},"SchemaVersion":{"Major":2,"Minor":0},"NamespaceId":51,"NamespaceGuid":"e569301b-7b5c-4554-b067-22a255632754"}'
ExpandedMessage : RPC request received. Type: 'Query', Entity: 'Namespace', Id: '{932331a3-b265-4fce-b4c8-079e70982379}', Access Level: 'Administrator', Data: '{"ID":"932331a3-b265-4fce-b4c8-079e70982379","Version":55834574851,"Health":{"Extra":{"Resources":{"Additional
                  Params":{},"AllocationOrder":1,"Allocators":[{"AdditionalParams":{},"AllocationOrder":0,"CompartmentGuid":"E569301B-7B5C-4554-B067-22A255632754","CompartmentId":51,"CompartmentType":0,"Flags":0,"Health":{"LastErrorCode":0,"LastUpdateTime":1335785543376
                  01682},"ID":"CD46223C-7C6D-4AC2-BD1E-9D040B1EC3C5","IsPolicy":false,"State":3,"Tag":"Network Compartment"}],"CompartmentOperationTime":32,"Flags":0,"Health":{"LastErrorCode":0,"LastUpdateTime":133578554329309128},"ID":"AD089CEF-2BEA-4E04-B440-5554B64D7
                  C06","PortOperationTime":0,"State":1,"SwitchOperationTime":0,"VfpOperationTime":0}}},"SchemaVersion":{"Major":2,"Minor":0},"NamespaceId":51,"NamespaceGuid":"e569301b-7b5c-4554-b067-22a255632754"}'

In this reproduction, the containerd error log is seen as below:

time="2024-04-17T20:58:07.196535100Z" level=info msg="StopPodSandbox for \"4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d\""
time="2024-04-17T20:58:07.196721100Z" level=info msg="Container to stop \"17b76d581bc8ccf200529f1fb13cd2e9517a0fd038195e7252097beecb18bc18\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
time="2024-04-17T20:58:07.243511200Z" level=debug msg="hcn::HostComputeNamespace::Delete id=932331a3-b265-4fce-b4c8-079e70982379"
time="2024-04-17T20:58:07.243511200Z" level=debug msg="hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
time="2024-04-17T20:58:07.243511200Z" level=error msg="StopPodSandbox for \"4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d\" failed" error="failed to remove network namespace for sandbox \"4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d
\": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}"
VincentVTran commented 2 months ago

Also here is the namespace log using Get-HnsNamespace:

ActivityId       : AD089CEF-2BEA-4E04-B440-5554B64D7C06
AdditionalParams :
CompartmentGuid  : E569301B-7B5C-4554-B067-22A255632754
CompartmentId    : 51
Containers       : {4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d}
Flags            : 0
Health           : @{LastErrorCode=0; LastUpdateTime=133578554329309128}
ID               : 932331A3-B265-4FCE-B4C8-079E70982379
IsDefault        : False
Policies         : {}
ResourceList     : {}
SchemaVersion    : @{Major=0; Minor=0}
State            : 4
Version          : 55834574851
Resources        : @{AdditionalParams=; AllocationOrder=1; Allocators=System.Object[]; CompartmentOperationTime=32; Flags=0; Health=; ID=AD089CEF-2BEA-4E04-B440-5554B64D7C06; PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0}
tzifudzi commented 2 months ago

After seeing the logs @VincentVTran shared it suggests that what you said when you mentioned below is likely whats happening.

"The namespace still has one or more active container attached..."

@grcusanz Can you please confirm this based on the log output? In the above log you can clearly see a container still attached to the namespace.

...
Containers       : {4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d}
...

When @VincentVTran co-debugged and tried to kill the container forcibly its still failing to delete the namespace and the container always shows up as attached. @VincentVTran Will be taking a deeper dive to find more clues about why the namespace fails to release the container attachment.

As a future improvement it would be nice if HCN operation response returned something more explicit in the error message.

VincentVTran commented 2 months ago

Also when running .\hcsdiag.exe kill 4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d to kill the pod, shell remains in a stuck state. When listing all the containers, the following output is seen - 4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d, Windows Server Container, Stopped, containerd-shim-runhcs-v1.exe. Can you also confirm that this is expected behavior from HCS? I suspect this may also be the reason why the namespace can't be deleted. Shouldn't the container be released from the HNS namespace if the container is stopped?

VincentVTran commented 2 months ago

Below is the HCS logs:

TimeCreated : 4/17/2024 11:35:28 PM Id : 2003 Message : [4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d] Terminate compute system, result 0xC0370103 ExpandedMessage : [4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d] Terminate compute system, result 0xC0370103

TimeCreated : 4/17/2024 11:33:47 PM Id : 2003 Message : [4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d] Terminate compute system, result 0xC0370103 ExpandedMessage : [4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d] Terminate compute system, result 0xC0370103

TimeCreated : 4/17/2024 11:31:15 PM Id : 2003 Message : [4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d] Terminate compute system, result 0xC0370103 ExpandedMessage : [4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d] Terminate compute system, result 0xC0370103


To confirm, the process does not exist anymore. Confirmed with `Get-Process -Id 61132`.
grcusanz commented 2 months ago

Hi All, thanks for the additional information, in particular the HNS snippet. Sorry I was out for a few days which delayed my response. I traced the return path from the two places where this exception occurs and matched it with the message in the log. It looks like the detailed message does get lost as the exception works its way through the call stack, so if there is an active container you will still get a generic message in the logs. Given your further investigations this is almost certainly the case.

Detach-HnsEndpoint should remove the container ID from the namespace, and then you should be able to delete the namespace. Try using that cmdlet to do the cleanup, then try to delete the namespace. Report back here with your findings.

VincentVTran commented 2 months ago

Hey @grcusanz , after cross-checking the endpoint IP with the remaining container IP - it seems like the HNSEndpoint for that container is already deleted. Using kubectl get pods -A -o=wide, I got the the following result:

default        sch-repro-2-5f57b78546-nmw9p    0/1     Terminating   0             6d    10.0.12.34    ip-10-0-14-181.us-west-2.compute.internal   <none>           <none>
default        sch-repro-2-5f57b78546-rbd4n    0/1     Terminating   0             6d    10.0.13.111   ip-10-0-14-181.us-west-2.compute.internal   <none>  

When fetching the HNS-Endpoint using Get-HNSEndpoint, I failed to see any endpoints with the following ip address of 10.0.12.34 for the following containers. Yet the containers are still attached to the network namespace.

In contrast to normal working pods, the endpoints with the corresponding IP addresses appear.

VincentVTran commented 2 months ago

Also before the endpoint was deleted, it has the name of "Endpoint name "cid-932331a3-b265-4fce-b4c8-079e70982379". So when trying to rerun the Detach-HnsEndpoint command, the following output was given:

Cmdlet Detach-HnsEndpoint at command pipeline position 1
Supply values for the following parameters:
EndpointID: 932331a3-b265-4fce-b4c8-079e70982379
ContainerID: 4b57fac484487d41925f27601c210d4a7634eb5407f9fffe28037f5eba3bbd0d
Invoke-HnsRequest : @{Success=False; Error=The endpoint was not found. }
At C:\Program Files\WindowsPowerShell\Modules\HNS\0.2.4\HNS.V2.psm1:2344 char:12
+     return Invoke-HnsRequest -Method $Method -Type endpoints -Action  ...
+            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
    + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Invoke-HnsRequest
syedthasan commented 2 months ago

Hi grcusanz, do you have any feedback about the potential next steps for this issue? As confirmed in the earlier logs, HNSEndpoint for the container is already deleted as Get-HNSEndpoint does not return any container with 10.0.12.34, yet the containers are attached to the namespace, as confirmed by kubectl get pods -A -o=wide and HcnDeleteNamespace fails with the error:

hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) 
{
  "Success": false,
  "Error": "The specified request is unsupported.",
  "ErrorCode": 2151350293
}
grcusanz commented 2 months ago

Hi All, I've been experimenting with this and managed to figure a few things out. First, run this command to get detailed info on the namespace: hnsdiag list namespaces <namespace id> -d

For example:

PS C:\Users\Administrator> hnsdiag list namespaces 7CB720C0-AC4C-4266-92E1-DC277B2A9B94 -d
{
    "ActivityId":"3B27B61E-0B57-4185-9ACC-4D7D23AA7580",
    "AdditionalParams":{

    },
    "CompartmentGuid":"4E95F905-60C7-4585-96FC-7964B546428B",
    "CompartmentId":6,
    "Containers":[
        "291cd4892b99",
        "291cd4892b99",
        "0f70f0bcccaf"
    ],
    "Flags":0,
    "Health":{
        "LastErrorCode":0,
        "LastUpdateTime":133590635864554881
    },
    "ID":"7CB720C0-AC4C-4266-92E1-DC277B2A9B94",
    "IsDefault":false,
    "Policies":[

    ],
    "ResourceList":[
        {
            "Data":{
                "Id":"7CC03AF7-2978-449F-9968-595862B80F8B"
            },
            "Type":"Endpoint"
        }
    ],
    "State":1,
    "Version":55834574851,
    "Resources":{
        "AdditionalParams":{

        },
        "AllocationOrder":1,
        "Allocators":[
            {
                "AdditionalParams":{

                },
                "AllocationOrder":0,
                "CompartmentGuid":"4E95F905-60C7-4585-96FC-7964B546428B",
                "CompartmentId":6,
                "CompartmentType":0,
                "Flags":0,
                "Health":{
                    "LastErrorCode":0,
                    "LastUpdateTime":133590636156275966
                },
                "ID":"29944F60-D8A7-4561-9733-6D36BF4A443D",
                "IsPolicy":false,
                "State":3,
                "Tag":"Network Compartment"
            }
        ],
        "CompartmentOperationTime":0,
        "Flags":0,
        "Health":{
            "LastErrorCode":0,
            "LastUpdateTime":133590635864554881
        },
        "ID":"3B27B61E-0B57-4185-9ACC-4D7D23AA7580",
        "PortOperationTime":0,
        "State":1,
        "SwitchOperationTime":0,
        "VfpOperationTime":0
    }
}

As long as you see items in the 'Containers' section you will not be able to delete the namespace. Fortunately, there is a way to clean that up.
First, check the resourcelist to see if there are any endpoints listed:

        {
            "Data":{
                "Id":"7CC03AF7-2978-449F-9968-595862B80F8B"
            },
            "Type":"Endpoint"
        }

If the endpoint still exists when you run get-hnsendpoint, then delete it using remove-hnsendpoint.

If the endpoint no longer exists, then follow this example to remove the reference from the namespace (replace \<namespace id> and \<endpoint id> with the appropriate guids):

PS C:\Users\Administrator> invoke-hnsrequest -method POST namespaces -Action removeResource -id <namespace id> -Data '{ "Data": { "Id": "<endoint id>"}, "Type": "Endpoint" }' -Verbose
VERBOSE: Invoke-HnsRequest Type[namespaces] Method[POST] Path[/namespaces/7CB720C0-AC4C-4266-92E1-DC277B2A9B94/removeResource] Data[{ "Data": { "Id": "7CC03AF7-2978-449F-9968-595862B80F8B"}, "Type": "Endpoint" }]
VERBOSE: Result : {"Success":true}

The Result should say Success:true. Note if you re-run hnsdiag it may still show up there. At this point ignore it. Repeat for any other namespace ids that no longer exist.

Now you should be able to remove the container references using a similar command (replace \<namespace id> and \<container id> with the appropriate Ids).:

PS C:\Users\Administrator> invoke-hnsrequest -method POST namespaces -Action detach -id <namespace id> -Data '{ "ContainerId": "<container id>", "SystemType": "Container" }' -Verbose
VERBOSE: Invoke-HnsRequest Type[namespaces] Method[POST] Path[/namespaces/7CB720C0-AC4C-4266-92E1-DC277B2A9B94/detach] Data[{ "ContainerId": "0f70f0bcccae", "SystemType": "Container" }]
VERBOSE: Result : {"Success":true}

Repeat for each container id.

If all of the above is successful, then you should be able to re-run: hnsdiag list namespaces <namespace id> -d And the container list should be empty.

Now you can delete the namespace: Get-HnsNamespace -id <namespace id> | remove-hnsnamespace

Let me know if that helps. It's a littl tricky to get right. If you get stuck, please share your output and the output of:

hnsdiag list all
get-hnsnamespace

Thanks!

KlwntSingh commented 2 months ago

Hi @grcusanz, Thanks for sharing the detailed debug steps, I followed them but still not able to delete HNS namespace. I am sharing output of each step to help you with analysis.

  1. Description of HNS Namespace, We can see container being mapped to namespace even when container is not running. We don't have any endpoint in resourceList. I will delete container from namespace in next command.

    hnsdiag list namespaces 18975570-D312-49D6-884B-59D780E77976 -d
    {
    "ActivityId":"776C0E9A-D1D9-43D4-A1C8-CA7CE377A521",
    "AdditionalParams":{
    
    },
    "CompartmentGuid":"CE9D7B9C-6094-4382-B1C8-6A1230E06FA6",
    "CompartmentId":44,
    "Containers":[
        "74f617c66d9be0e07d8ada45e831f157d6814ef54a83be85e83e9f58805fcb99"
    ],
    "Flags":0,
    "Health":{
        "LastErrorCode":0,
        "LastUpdateTime":133578557334505589
    },
    "ID":"18975570-D312-49D6-884B-59D780E77976",
    "IsDefault":false,
    "Policies":[
    
    ],
    "ResourceList":[
    
    ],
    "SchemaVersion":{
        "Major":0,
        "Minor":0
    },
    "State":4,
    "Version":55834574851,
    "Resources":{
        "AdditionalParams":{
    
        },
        "AllocationOrder":1,
        "Allocators":[
            {
                "AdditionalParams":{
    
                },
                "AllocationOrder":0,
                "CompartmentGuid":"CE9D7B9C-6094-4382-B1C8-6A1230E06FA6",
                "CompartmentId":44,
                "CompartmentType":0,
                "Flags":0,
                "Health":{
                    "LastErrorCode":0,
                    "LastUpdateTime":133578557934959281
                },
                "ID":"1D26065C-544F-4EB6-A7B9-2AA7941F2C15",
                "IsPolicy":false,
                "State":3,
                "Tag":"Network Compartment"
            }
        ],
        "CompartmentOperationTime":0,
        "Flags":0,
        "Health":{
            "LastErrorCode":0,
            "LastUpdateTime":133578557334505589
        },
        "ID":"776C0E9A-D1D9-43D4-A1C8-CA7CE377A521",
        "PortOperationTime":0,
        "State":1,
        "SwitchOperationTime":0,
        "VfpOperationTime":0
    }
    }
  2. Output of command when deleting container from namespace. I got following error but internally it did delete container mapping as shown in step#3.
    
    invoke-hnsrequest -method POST namespaces -Action detach -id 18975570-D312-49D6-884B-59D780E77976 -Data '{ "ContainerId": "74f617c66d9be0e07d8ada45e831f157d6814ef54a83be85e83e9
    f58805fcb99", "SystemType": "Container" }' -Verbose
    VERBOSE: Invoke-HnsRequest Type[namespaces] Method[POST] Path[/namespaces/18975570-D312-49D6-884B-59D780E77976/detach] Data[{ "ContainerId":
    "74f617c66d9be0e07d8ada45e831f157d6814ef54a83be85e83e9f58805fcb99", "SystemType": "Container" }]
    VERBOSE: Result : {"Success":false,"Error":"An internal error occurred. Unable to delete compartment"}
    invoke-hnsrequest : @{Success=False; Error=An internal error occurred. Unable to delete compartment}
    At line:1 char:1
    + invoke-hnsrequest -method POST namespaces -Action detach -id 18975570 ...
    + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [Write-Error], WriteErrorException
    + FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Invoke-HnsRequest
3. Description of hns namespace. Container is no longer associated with namespace and it should be good for deletion. I will try deleting namespace in next step.

C:> hnsdiag list namespaces 18975570-D312-49D6-884B-59D780E77976 -d { "ActivityId":"776C0E9A-D1D9-43D4-A1C8-CA7CE377A521", "AdditionalParams":{

},
"CompartmentGuid":"CE9D7B9C-6094-4382-B1C8-6A1230E06FA6",
"CompartmentId":44,
"Containers":[

],
"Flags":0,
"Health":{
    "LastErrorCode":0,
    "LastUpdateTime":133578557334505589
},
"ID":"18975570-D312-49D6-884B-59D780E77976",
"IsDefault":false,
"Policies":[

],
"ResourceList":[

],
"SchemaVersion":{
    "Major":0,
    "Minor":0
},
"State":5,
"Version":55834574851,
"Resources":{
    "AdditionalParams":{

    },
    "AllocationOrder":1,
    "Allocators":[
        {
            "AdditionalParams":{

            },
            "AllocationOrder":0,
            "CompartmentGuid":"CE9D7B9C-6094-4382-B1C8-6A1230E06FA6",
            "CompartmentId":44,
            "CompartmentType":0,
            "Flags":0,
            "Health":{
                "LastErrorCode":0,
                "LastUpdateTime":133578557934959281
            },
            "ID":"1D26065C-544F-4EB6-A7B9-2AA7941F2C15",
            "IsPolicy":false,
            "State":3,
            "Tag":"Network Compartment"
        }
    ],
    "CompartmentOperationTime":0,
    "Flags":0,
    "Health":{
        "LastErrorCode":0,
        "LastUpdateTime":133578557334505589
    },
    "ID":"776C0E9A-D1D9-43D4-A1C8-CA7CE377A521",
    "PortOperationTime":0,
    "State":1,
    "SwitchOperationTime":0,
    "VfpOperationTime":0
}

}

4. output of command to delete hns namespace, it threw following error and failed to delete namespace.

Get-HnsNamespace -id 18975570-D312-49D6-884B-59D780E77976 | remove-hnsnamespace ReportErrors : HcnDeleteNamespace -- HRESULT: 2147943759. Result: {"Success":false,"Error":"An internal error occurred. ","ErrorCode":2147943759} At C:\Program Files\WindowsPowerShell\Modules\HNS\0.2.4\HNS.V2.psm1:596 char:13

ActivityId : 776C0E9A-D1D9-43D4-A1C8-CA7CE377A521 AdditionalParams : CompartmentGuid : CE9D7B9C-6094-4382-B1C8-6A1230E06FA6 CompartmentId : 44 Containers : {} Flags : 0 Health : @{LastErrorCode=0; LastUpdateTime=133578557334505589} ID : 18975570-D312-49D6-884B-59D780E77976 IsDefault : False Policies : {} ResourceList : {} SchemaVersion : @{Major=0; Minor=0} State : 5 Version : 55834574851 Resources : @{AdditionalParams=; AllocationOrder=1; Allocators=System.Object[]; CompartmentOperationTime=0; Flags=0; Health=; ID=776C0E9A-D1D9-43D4-A1C8-CA7CE377A521; PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0}



We are still not able to delete HNS namespace as can be seen in command output of step#4. We can also see that Step#2 was successfully in deleting container from hns namespace even though output shows failure.
KlwntSingh commented 2 months ago

Hi @grcusanz , Can you please provide feedback with next steps? I followed your suggestion as mentioned in previous comment but it still didn't help with deleting HNS namespace. Please refer my previous comment with more details on output of each step. Please let me know any more information is required from my side.

jiechen0826 commented 1 month ago

Hi @grcusanz, just follow up to see if you have any updates on this? Thanks!

SovietFrontier commented 2 weeks ago

Hi @grcusanz,

Is there any updates on this yet that you can share? Thank you!