Namespace deletion stuck due to pod out of sync ?

Fabian-K commented 2 years ago

Hi,

we are currently facing the following issue when running an integration test against a vcluster (version 0.3.1). The test deploys some workload to the cluster, in this case istio. After executing the actual test, it deletes all resources one after another.

During this deletion phase, it gets stuck when deleting a namespace (istio-system). When looking at the namespace, it cannot be deleted because there exists still a pod (Some resources are remaining: pods. has 1 resource instances).

The pod is according to vcluster in the phase: "Running", however when looking at the host cluster, the pod is already gone.

The logs of the vcluster controller show the following lines over and over again:

08:32:06.985 E0819 08:32:06.985112       1 controller.go:302] controller-runtime: manager: reconciler group  reconciler kind Pod: controller: pod-forward: name ingressgateway-6d8b9cf54c-j6lfl namespace istio-system: Reconciler error get service account ingressgateway-service-account: ServiceAccount "ingressgateway-service-account" not found
08:32:08.377 E0819 08:32:08.377566       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:11.602 E0819 08:32:11.602302       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:12.717 E0819 08:32:12.717352       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:16.002 E0819 08:32:16.002687       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:21.238 E0819 08:32:21.238534       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:23.981 E0819 08:32:23.980936       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:24.142 E0819 08:32:24.142580       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:25.381 E0819 08:32:25.381046       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:27.038 E0819 08:32:27.038698       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:33.481 E0819 08:32:33.481459       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:35.854 E0819 08:32:35.853780       1 reflector.go:138] k8s.io/client-go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: the server could not find the requested resource
08:32:39.094 E0819 08:32:39.094549       1 namespace_controller.go:162] deletion of namespace istio-system failed: unexpected items still remain in namespace: istio-system for gvr: /v1, Resource=pods

Could it be that the missing (most likely in the meantime deleted service account) leads to the pods not being synced anymore? Let me know when you need more information!

Thanks, Fabian

FabianKramm commented 2 years ago

@Fabian-K thanks for creating this issue! Could you also provide the logs of the syncer container when this issue occurs? Does manually deleting the pod have an effect?

Fabian-K commented 2 years ago

Thanks for looking into this! Sure, the syncer log mostly contains controller.go:302] controller-runtime: manager: reconciler group reconciler kind Pod: controller: pod-forward: name ingressgateway-6d8b9cf54c-j6lfl namespace istio-system: Reconciler error get service account ingressgateway-service-account: ServiceAccount "ingressgateway-service-account" not found

I've attached the complete log of the syncer and the cluster container. In the files, there is also the local time prepended.

If I remember correctly, force-deleting the pod does resolve the issue. I think I did that around 9:28 but I´ll try to reproduce and double-check!

log-syncer.txt log-vcluster.txt

Update: Yes, force-deleting the "ghost" pod on vcluster side does help!

FabianKramm commented 2 years ago

@Fabian-K thanks a lot for the information! I think we found the issue and included the fix in the newest beta (v0.4.0-beta.2), would be great if you could test with this version and confirm that this problem is fixed (we just released the new version, so it will probably take around an hour until the new version is available in the helm registry)

Fabian-K commented 2 years ago

Cool, thanks a lot! I´ll give it a try - most likely tomorrow :)

Fabian-K commented 2 years ago

Yes, this is fixed in v0.4.0-beta.2. Thanks a lot @FabianKramm !

loft-sh / vcluster

Namespace deletion stuck due to pod out of sync ? #119