is_gke() function does not always return true on some nodes

EddieX64 commented 6 months ago

Hello everyone,

I recently deployed Sysbox on GKE 1.27 using Ubuntu with containerd image type according to steps described in sysbox user guide and encountered an issue where the is_gke() function does not always return true on some nodes when the cluster autoscaler is scaling up nodes. This is peculiar because the curl command should return HTTP 200 every time. To investigate this issue, I modified the DaemonSet by including init container with the following content to confirm that there are no problems with the GKE metadata server:

initContainers:
      - name: check-gke-metadata-api
        image: curlimages/curl:latest
        command: ['sh', '-c', 'until curl -s -o /dev/null -w "%{http_code}" --connect-timeout 1 -H "Metadata-Flavor: Google" 169.254.169.254/computeMetadata/v1/instance/attributes/cluster-name; do echo Waiting for GKE metadata to become available; sleep 2; done; echo -e "\nGKE metadata is available"']

Init container always return HTTP 200 and i can see it in the DaemonSet pod logs on a newly created node:

2024-04-23 16:21:22.171 UTC 200
2024-04-23 16:21:22.171 UTC GKE metadata is available

However, after CRI-O installation has started, i dont see the message "Configuring CRI-O for GKE" in the same log, which should be the case as GKE metadata server answered with HTTP 200 in init container:

2024-04-23 16:37:25.060 UTC 200
2024-04-23 16:37:25.060 UTC GKE metadata is available
2024-04-23 16:39:01.805 UTC Detected Kubernetes version v1.27
2024-04-23 16:39:01.823 UTC Adding K8s taint "sysbox-runtime=not-running:NoSchedule" to node ...
2024-04-23 16:39:01.924 UTC node/gke-agent-pool-e4525f26-8ws4 modified
2024-04-23 16:39:01.927 UTC Adding K8s label "crio-runtime=installing" to node ...
2024-04-23 16:39:02.021 UTC node/gke-agent-pool-e4525f26-8ws4 labeled
2024-04-23 16:39:02.024 UTC Deploying CRI-O installer agent on the host (v1.27) ...
2024-04-23 16:39:02.659 UTC Running CRI-O installer agent on the host (may take several seconds) ...
2024-04-23 16:39:09.623 UTC Removing CRI-O installer agent from the host ...
2024-04-23 16:39:09.623 UTC Configuring CRI-O ...
2024-04-23 16:39:09.623 UTC Adding K8s label "sysbox-runtime=installing" to node ...
2024-04-23 16:39:09.623 UTC node/gke-agent-pool-e4525f26-8ws4 labeled
2024-04-23 16:39:09.623 UTC Installing Sysbox dependencies on host ...

I'm not observing a direct correlation with the GKE configuration, as the message "Configuring CRI-O for GKE" still appears on some of the nodes that have been auto-scaled. This leads me to believe that the issue might not be directly related to the GKE configuration. Any insights or suggestions would be greatly appreciated.

EddieX64 commented 5 months ago

So i went to check the gke-metadata-server pod logs on the problematic node at the timing when i see message Configuring CRI-O ... and found something interesting in the logs. During the curl request from sysbox-deploy-k8s pod to gke-metadata-api pod:

2024-04-27 06:47:39.977 UTC "[conn-id:a6244d77fb660d44] Unable to find pod: generic::unavailable: connection error: desc = "transport: Error while dialing: dial unix /var/run/containerd/containerd.sock: connect: no such file or directory""

2024-04-27 06:47:39.977 UTC "[conn-id:a6244d77fb660d44 rpc-id:2b705477a92aab1c] Caller is not authenticated"

I can even see the previous successful reply from the init container around 30 seconds ago:

2024-04-27 06:47:10.533 UTC "[conn-id:aa3fbe043a67920d ip:100.64.4.2 pod:kube-system/sysbox-deploy-k8s-clpjl] Found calling pod; storing in context"

2024-04-27 06:47:10.533 UTC "[conn-id:aa3fbe043a67920d ip:100.64.4.2 pod:kube-system/sysbox-deploy-k8s-clpjl rpc-id:9d336f826f9e0108] "/computeMetadata/v1/instance/attributes/cluster-name" HTTP/200, started at 2024-04-27 06:47:10.533484842 +0000 UTC m=+8.723607231"

Indeed, there is something wrong with the gke-metadata-api as the request caller is not authenticated. For the moment i have no idea how to deal with it because previously it already replied with HTTP 200 in init container. It might be worth opening a support ticket to GCP support for additional investigation on the problem with GKE metadata api.

ctalledo commented 5 months ago

Fixed via https://github.com/nestybox/sysbox-pkgr/pull/127.

Thanks @EddieX64 for helping us catch and fix this!

nestybox / sysbox-pkgr

is_gke() function does not always return true on some nodes #126