Closed EddieX64 closed 5 months ago
So i went to check the gke-metadata-server pod logs on the problematic node at the timing when i see message Configuring CRI-O ...
and found something interesting in the logs. During the curl request from sysbox-deploy-k8s pod to gke-metadata-api pod:
2024-04-27 06:47:39.977 UTC "[conn-id:a6244d77fb660d44] Unable to find pod: generic::unavailable: connection error: desc = "transport: Error while dialing: dial unix /var/run/containerd/containerd.sock: connect: no such file or directory""
2024-04-27 06:47:39.977 UTC "[conn-id:a6244d77fb660d44 rpc-id:2b705477a92aab1c] Caller is not authenticated"
I can even see the previous successful reply from the init container around 30 seconds ago:
2024-04-27 06:47:10.533 UTC "[conn-id:aa3fbe043a67920d ip:100.64.4.2 pod:kube-system/sysbox-deploy-k8s-clpjl] Found calling pod; storing in context"
2024-04-27 06:47:10.533 UTC "[conn-id:aa3fbe043a67920d ip:100.64.4.2 pod:kube-system/sysbox-deploy-k8s-clpjl rpc-id:9d336f826f9e0108] "/computeMetadata/v1/instance/attributes/cluster-name" HTTP/200, started at 2024-04-27 06:47:10.533484842 +0000 UTC m=+8.723607231"
Indeed, there is something wrong with the gke-metadata-api as the request caller is not authenticated. For the moment i have no idea how to deal with it because previously it already replied with HTTP 200 in init container. It might be worth opening a support ticket to GCP support for additional investigation on the problem with GKE metadata api.
Fixed via https://github.com/nestybox/sysbox-pkgr/pull/127.
Thanks @EddieX64 for helping us catch and fix this!
Hello everyone,
I recently deployed Sysbox on GKE 1.27 using Ubuntu with containerd image type according to steps described in sysbox user guide and encountered an issue where the
is_gke()
function does not always return true on some nodes when the cluster autoscaler is scaling up nodes. This is peculiar because the curl command should return HTTP 200 every time. To investigate this issue, I modified the DaemonSet by including init container with the following content to confirm that there are no problems with the GKE metadata server:Init container always return HTTP 200 and i can see it in the DaemonSet pod logs on a newly created node:
However, after CRI-O installation has started, i dont see the message "Configuring CRI-O for GKE" in the same log, which should be the case as GKE metadata server answered with HTTP 200 in init container:
I'm not observing a direct correlation with the GKE configuration, as the message "Configuring CRI-O for GKE" still appears on some of the nodes that have been auto-scaled. This leads me to believe that the issue might not be directly related to the GKE configuration. Any insights or suggestions would be greatly appreciated.