Closed fbozic closed 1 year ago
Thanks for reporting this @fbozic. The fix for the issue should be part of today's releases: https://github.com/kubernetes/kops/releases/tag/v1.24.5 https://github.com/kubernetes/kops/releases/tag/v1.25.3
That was really fast @hakman, thank you.
/kind bug
I'm trying to set up a new cluster on GCE with cilium-etcd networking.
1. What
kops
version are you running? The commandkops version
, will display this information.Client version: 1.25.2 (git-v1.25.2)
2. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag.3. What cloud provider are you using?
gce
4. What commands did you run? What is the simplest way to reproduce this issue?
kops create -f kops.yaml
kops update cluster --name my-fake-name.k8s.local --yes
kops export kubecfg --admin
kops validate cluster --wait 10m
5. What happened after the commands executed?
After 10 minutes cluster becomes healthy (from kops point of view), but cilium pods running on worker nodes keep restarting due to timeout while retrieving initial list of objects from kvstore. Cilium pods running on master nodes do not experience the same issues.
6. What did you expect to happen?
Cilium pods running on worker nodes are not continuously restarting.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.8. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here.During cluster provisioning, everything looks okay in the kops output so I will not attach it. Cilium pod status after ~1 hour.
Logs from
cilium-9pdwk
containmsg="Unable to initialize local node. Retrying..." error="timeout while retrieving initial list of objects from kvstore"
9. Anything else do we need to know?
I've inspected
node-to-master-<>
firewall rule and there is no port 4003. I've added port 4003 to the firewall rule by hand. After that pods on worker nodes stoped restarting and there was no more error messages in the logs.