Closed fvasco closed 2 years ago
Also seeing this after upgrading a 1.21 cluster to 1.22. Our 1.21 cluster has the expected number of endpoints, but the 1.22 cluster does indeed have additional endpoints. We are experiencing the same issue with timeouts which causes extremely slow pod startup time due network timeouts setting up the sandbox.
@fvasco your issue above got me on the right track to fix my problem. I had been able to reproduce the timeouts but hadn't quite figured out why yet, as everything looked okay on the cluster and infra.
I had 5 control plane nodes listed in etcd, which is where the kubernetes default svc populates the endpoints.
During my upgrade from 1.21 to 1.22, the etcd upgrade was stuck going to 3.5 and I terminated all 3 masters and let them come back (effectively performing the etcd restore process). This allowed etcd to proceed but then I was hit with the random network failures.
Following the docs to find and delete the additional master leases in etcd resolved the issue for me.
Happy to hear that, @erismaster, can you share with us some useful links or steps to delete the pending masters on our cluster? This can greatly improve our status.
Thank you again, Francesco
I can confirm that we detected this issue after a failed upgrade from kops 1.21.2 to 1.22.1, a full rollback does not fix this for us.
In my memory, our endpoint list contains a lot of IPs from a while, but this was not an issue, or at least this was not a relevant problem.
This issue is pretty hard to find because the issues are pretty random, so changing CNI, availability zone or some other components can work or not work in a random fashion for each node, and working nodes can fail over time.
The guide https://kops.sigs.k8s.io/operations/etcd_administration/ does not work for us, many pod command don't work.
We logged into the right container using
CONTAINER=$(kubectl get pods -n kube-system | grep etcd-manager-main | head -n 1 | awk '{print $1}')
kubectl exec -it -n kube-system $CONTAINER bash
then we delete pending IPs using command like:
# DIRNAME=/opt/etcd-v3.4.13-linux-amd64/etcd
# ETCDCTL_API=3
# alias etcdctl='$DIRNAME/etcdctl --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key --endpoints=https://127.0.0.1:4001'
# etcdctl get get --prefix /registry/masterleases
# etcdctl del /registry/masterleases/172.31.20.192
We keep looking at this page to understand if an upgrade to kops 1.22 is safe for us.
I suspect you are hitting https://github.com/kubernetes/kubernetes/issues/86812 This problem is also mentioned here: https://kops.sigs.k8s.io/operations/troubleshoot/#api-server-hangs-after-etcd-restore
We know this often happens when recovering from backup, but not seen this when upgrading etcd. Even though upgrading etcd is not too different from a backup restore.
I am afraid, there is not too much we can do from kOps side here.
Is it possible to remove the master IP proactively during a cluster rolling-update? Any terminated master IP can be removed immediately from the pool, without waiting for timeout, just after the node removal.
I don't think that is exactly what's happening here. Upgrading etcd is something along the lines of "take a backup, create a new cluster with the backup, when all nodes are upgaded–hotswap the etcd cluster". So the backup probably has then the old leases, new apiservers write the new leases, and the old leases are not cleaned up.
So kOps could write something into kops-controller that cleans up old leases from etcd. But I really do not understand why they are considered valid... but the mentioned bug has been there for a long time without any real fix.
We found these entries in etcd
# etcdctl get --prefix /registry/masterleases
/registry/masterleases/172.31.20.192
k8s
v1 Endpoints)
▒"*28lBz
172.31.20.192▒▒"
/registry/masterleases/172.31.23.72
k8s
v1 Endpoints)
▒"*28�uBz
172.31.23.72▒▒"
/registry/masterleases/172.31.31.210
k8s
v1 Endpoints)
▒"*28Bz
172.31.31.210▒▒"
and so on, many of them are invalid. This bug affected our production environment for a week, master and nodes aren't able to communicate, health checks failed and our services were down. Handling this issue in kOps is a workaround, I understand, unfortunately, this issue is hard to detect and can hurt the entire cluster, so I hope can be appropriately considered.
What kind of lease TTL do you see on the incorrect items? Technically these should have timed out long ago.
Yes @olemarkus, they should be expired days ago, we permanently deleted them, so I don't know how to check TTLs.
Can you add your input to the upstream issue. Meanwhile, I can try to reproduce this, but it may not be that easy.
Thank you for your effort @olemarkus! As I said, it is possible it was an already present issue of previous upgrades.
However, I have to admit that I am not a Kubernetes Champion, so I reach to this issue using the hard way (digging into a Linux networking issue).
For our experience, in absence of a better proposal, an informational message in kops validate cluster
should be very helpful.
A cluster with more Kubernetes endpoints than configured isn't properly valid.
So, I can imagine the rule as:
IF kubernetes endpoints count
> configured master count
PRINT "kubernetes" service contains too many endpoints, this can cause connection issues to kubernetes service host:port
. If the problem persists, see permalink
Moreover, I suggest modifying the first phrase of API Server hangs after etcd restore to: After resizing an etcd cluster, restoring backup or updating kOps
Finally, if this issue can occur generally during kOps manual operations, a tool like kops toolbox cleanup-etcd
can greatly improve the user experience.
i have the same issue, when i upgrade from 1.21 cluster to 1.22 my kubernetes endpoint have more servers than configured masters.
i want to apply the troubleshoot for delete the masterleases but my etcd-manager-main pod have different certificates configurated:
# ls /rootfs/etc/kubernetes/pki/
etcd-manager-events etcd-manager-main
and the etcdctl connection have errors with this certificates:
# ./etcdctl --cacert=/rootfs/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt --cert=/rootfs/etc/kubernetes/pki/etcd-manager-main/etcd-manager-client-etcd-a.crt --key=/rootfs/etc/kubernetes/pki/etcd-manager-main/etcd-manager-client-etcd-a.key --endpoints=https://127.0.0.1:4001 del --prefix /registry/masterleases/
{"level":"warn","ts":"2021-11-05T07:05:01.755Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002e0a80/#initially=[https://127.0.0.1:4001]","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: authentication handshake failed: remote error: tls: bad certificate\""}
finally i found the kube-apiserver certificates in the control-plane machine and the troubleshoot works perfectly
Hi, @olemarkus, any news on this issue?
Per Office Hours, will cut a new etcd-manager and cherrypick it to 1.22 branch.
Great news @johngmyers, we hope it will fix!
Per Office Hours, will cut a new etcd-manager and cherrypick it to 1.22 branch.
Was this fixed in kops v.1.22.2?
In my case not, i need to apply the troubleshoot, but from the control-plane machine, not the etcd-manager-main pod, because the kube-apiserver certificates are not present in the pod (or i don't find it)
Yes, this should have been fixed in 1.22.2. At least known causes why this happens.
The kube-apiserver certificates are not present in the etcd-manager pods. But you can connect using the certificates should be in /etcd/kubernetes/pki.
I've also spent several hours battling this problem yesterday - very happy to have found this thread but it took me a while to find the right certificates to use as I guess the naming has changed
I went into the etcd-manager-main
pod - the certs were there for me in /etc/kubernetes/pki/etcd-manager
This was the right combination of paths and certs for me
alias etcdctl='$DIRNAME/etcdctl --cacert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --cert=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.crt --key=/etc/kubernetes/pki/etcd-manager/etcd-clients-ca.key --endpoints=https://127.0.0.1:4001'
If the troubleshooting docs on this is incorrect, can you do a PR to update them with the correct paths?
/kind bug
1. What
kops
version are you running?2. What Kubernetes version are you running?
3. What cloud provider are you using? AWS
Hello, we got a connectivity issue with our pods.
We currently see too many IPs in the
kubernetes
endpoints, many of them reference to terminated masters.the iptable
nat
table list all of themthe first IP is currently unavailable and we detect error issues when a POD is starting:
Get https://100.64.0.1:443/api?timeout=32s: dial tcp 100.64.0.1:443: i/o timeout
Is it possible to optimize the service and shrink the list to the available masters only?
7. Please provide your cluster manifest.
Thank you in advance for any help, Francesco