etcd-manager restore leads to incorrect addresses in "kubernetes" endpoint

trondhindenes commented 5 years ago

We're testing out a procedure for a full master refresh using kops/etcd-manager (described here: https://hindenes.com/2019-08-09-Kops-Restore/). In short, we wipe the masters, let kops set up new masters, and use etcd-manager-ctl to restore the last known backup. This seems to work very well.

However, we're noticing that in-cluster apps that need access to the Kubernetes api sometimes fail. This seems to be caused by the fact that old (deleted) masters are still present in the kubernetesendpoint (kubectl -n default get endpoints kubernetes -o=yaml).

This is probably not a etcd-manager problem at all, but I'm at a loss regarding how to get rid of references to old (non-existing) masters, so any pointers would be deeply appreciated.

trondhindenes commented 5 years ago

also probably worth mentioning that if I do a "regular" kops rolling-upgrade that replaces master nodes, we're not seing the problem with left-behind master ip addresses. It only happens if we do etcd-manager restore.

dzoeteman commented 5 years ago

I've added some documentation in #251 on how to solve this issue. Basically, if a master doesn't get deleted normally, the IP will stick around /registry/masterleases in etcd. This is more of a kubernetes thing than a kops or etcd-manager thing IMO. We could add a restore step that fixes this automatically, but it seems not exactly right to have a manager edit data. Might be nice to have it in kops though (kops restore)?

Interested to hear what others think.

kopeio / etcd-manager

etcd-manager restore leads to incorrect addresses in "kubernetes" endpoint #248