kumahq / kuma

🐻 The multi-zone service mesh for containers, Kubernetes and VMs. Built with Envoy. CNCF Sandbox Project.
https://kuma.io/install
Apache License 2.0
3.6k stars 332 forks source link

Pod start delayed by 30-60s on GKE when oslogin enabled with CNI #11038

Closed bartsmykla closed 1 month ago

bartsmykla commented 1 month ago

What happened?

We are basically susceptible to the same issue as described in https://github.com/istio/istio/issues/48416

With this setup, all pods will take ~60s to startup.

The root cause is a bad interaction between these causing a circular dependency. When a pod starts, istio-cni will be invoked. At this point, the Pod is created but not containers are running yet. istio-cni more or less calls nsenter -- iptables-restore. Our iptables commands use the xt_owner module (-m owner) which in turn calls getpwnam (https://git.netfilter.org/iptables/tree/extensions/libxt_owner.c#n150). This triggers PAM for authentication. When OSLogin is enabled on the GCE machine, a PAM module will be loaded: https://github.com/GoogleCloudPlatform/guest-oslogin. When triggered, this will call the metadata server (request like /computeMetadata/v1/oslogin/users?username=...).

The GKE Metadata Server will detect this as a request coming from the pod. The pod, however, has not yet started, so the request is denied. The PAM module will retry this request a number of times before giving up. Once it gives up, execution can continue as usual.

This impacts all Istio versions, but only recent GKE versions (later patch releases in GKE 1.25+). It does not impact Autopilot, which cannot use oslogin.

slonka commented 2 weeks ago

@bartsmykla did you check kuma on GKE autopilot?