Workarounds for the time before kubeadm HA becomes available

mbert commented 6 years ago

The planned HA features in kubeadm are not going to make it into v1.9 (see #261). So what can be done to make a cluster setup by kubeadm sufficiently HA?

This is what it looks like now:

Worker nodes can be scaled up to achieve acceptable redundance.
Without a working active/active or at least active/passive master setup, master failures are likely to cause significant disruptions.

Hence an active/active or active/passive master setup needs to be created (i.e. mimic what kubeadm would supposedly be doing in the futue):

Replace the local etcd pod by an etcd cluster of min. 2 x number-of-masters size. This cluster could running on the OS rather than in K8s.
Set up more master instances. That's the interesting bit. The Kubernetes guide for building HA clusters (https://kubernetes.io/docs/admin/high-availability/) can be of help to understand what needs to be done. Here I'd like to have simple step-by-step instructions taking into consideration kubeadm-setup particularities in the end.
Not sure whether this is necessary: Probably set up haproxy/keepalived on the master hosts, move the original master's IP address plus SSL termination to it.

This seems achievable if converting the existing master instance to a cluster of masters (2) can be done (the Kubernetes guide for building HA clusters seems to indicate so). Active/active would be not more expensive than active/passive.

I am currently working on this. If I succeed I shall share what I find out here.

petergardfjall commented 6 years ago

@jamiehannaford The problem when using Amazon's ELB is that it doesn't provide a single, stable IP address, so there is no such LB IP that I can make use of (see https://stackoverflow.com/a/35317682/7131191).

So for now the workers join via the ELB's FQDN, which will forward it to one of the apiservers, which, since it advertises its own IP address, makes the worker configure its kubelet to use that IP address (and not the ELB FQDN). Therefore, to make sure that the kubelet goes through the apiserver load-balancer the kubelet.conf needs to be patched afterwards with the ELB FQDN and the kubelet restarted.

discordianfish commented 6 years ago

I've just open sourced out stab on HA kubeadm. Comes with a few caveats and ugly workaround (especially the kube-proxy hack is ugly). But it works: https://github.com/itskoko/kubecfn

mbert commented 6 years ago

I have done some work on the HA setup guide on google docs:

took over some changes proposed in comments
support for 'calico' network plugin
took over etcd SSL setup from the official guide document written by @jamiehannaford

Those changes have been implemented in my ansible-based automation of the described process, plus some more:

automatic setup of etcd-operator for applications running in the cluster (not the cluster itself)
prefetching of images needed for Kubernetes operation and copying them to the cluster hosts
Dashboard setup (insecure, without SSL) with port 30990 on NodePort (if no LB is configured)

petergardfjall commented 6 years ago

I've published the kubeadm-based HA kubernetes installer script I've been working on lately. It will hopefully put my prior comments into context and serve as one concrete example of how to automate the steps of @jamiehannaford's HA guide, which it follows fairly closely.

It's a python script that executes in two phases: render which creates "cluster assets" in the form of SSH keys, certs, and bootscripts, and an install phase which executes those bootscripts over SSH.

The scripts have been tried out on a local Vagrant cluster and against AWS. Two "infrastructure provider scripts" are included in the repo (vagrant and AWS via Terraform) to provision the necessary cluster load-balancer and VMs.

Feel free to try it out. https://github.com/elastisys/hakube-installer

mbert commented 6 years ago

I have not yet found a way to upgrade a HA cluster installed using kubeadm and the manual steps described in my HA setup guide on google docs.

What I have tried so far is the following:

Shut down keepalived on the secondary masters, run kubeadm upgrade on the primary master, apply the same changes in /etc/kubernetes/manifests on the secondary masters as there were on the primary master and start keepalived on the secondary masters.
Same like (1), but in addition to keepalived, also shut down (and later start) kubelet and docker on the secondary masters.
Same like (2), but before applying the upgrade on the primary master, cordon (and later uncordon) all secondary masters.

This did not work, and the result was pretty much the same in all cases. What I get in the secondary masters' logs looks like this:

Unable to register node "master-2.mylan.local" with API server: nodes "master-2.mylan.local" is forbidden: node "master-1.mylan.local" cannot modify node "master-2.mylan.local"

Failed to update status for pod "kube-apiserver-master-2.mylan.local_kube-system(6d84ab47-0008-11e8-a558-0050568a9775)": pods "kube-apiserver-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Failed to update status for pod "kube-controller-manager-master-2.mylan.local_kube-system(665da2db-0008-11e8-a558-0050568a9775)": pods "kube-controller-manager-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Failed to update status for pod "kube-scheduler-master-2.mylan.local_kube-system(65c6a0b3-0008-11e8-a558-0050568a9775)": pods "kube-scheduler-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Failed to update status for pod "kube-flannel-ds-ch8gq_kube-system(47cccaea-0008-11e8-b5b5-0050568a9e45)": pods "kube-flannel-ds-ch8gq" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Failed to update status for pod "kube-proxy-htzg7_kube-system(47cc9d00-0008-11e8-b5b5-0050568a9e45)": pods "kube-proxy-htzg7" is forbidden: node "master-1.mylan.local" can only update pod status for pods with spec.nodeName set to itself

Deleting mirror pod "kube-controller-manager-master-2.mylan.local_kube-system(665da2db-0008-11e8-a558-0050568a9775)" because it is outdated

Failed deleting a mirror pod "kube-controller-manager-master-2.mylan.local_kube-system": pods "kube-controller-manager-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only delete pods with spec.nodeName set to itself

Failed creating a mirror pod for "kube-controller-manager-master-2.mylan.local_kube-system(78432ebfe5d8dfbb93f8173decf3447e)": pods "kube-controller-manager-master-2.mylan.local" is forbidden: node "master-1.mylan.local" can only create pods with spec.nodeName set to itself

[... and so forth, repeats itself ...]

Has anybody got a hint how to proceed in getting the secondary masters upgraded cleanly?

jamiehannaford commented 6 years ago

@mbert This seems like an RBAC issue. Did you ensure the node name matches the hostname-override?

Also, did you reset etcd for each step? That probably explains why you saw the same result.

mbert commented 6 years ago

@jamiehannaford I am not using any hostname override, neither in kubelet nor in the kubeadm init configuration. And, yes, I am resetting etcd, i.e. tear down the cluster, install a new one from the scratch, then try to upgrade it.

I'll give setting a hostname-override for kubelet a shot and see whether this leads to any other result.

mbert commented 6 years ago

It seems like setting hostname-override when setting up the cluster helps, i.e., makes the secondary masters upgradable. Once this has become a standardised procedure I will document it in the HA setup guide in google docs.

andybrucenet commented 6 years ago

Hi @mbert and others - From the past year or so, I have several k8s clusters (kubeadm and otherwise) driven from Cobbler / Puppet on CoreOS and CentOS. However, none of these has been HA.

My next task is to integrate K8s HA and I want to use kubeadm. I'm unsure whether to go with the @mbert's HA setup guide or @jamiehannaford's HA guide.

Also - this morning I read @timothysc's Proposal for a highly available control plane configuration for ‘kubeadm’ deployments. and I like the "initial etcd seed" approach he outlines. However, I don't see that same approach in either @mbert or @jamiehannaford's work. @mbert appears to use a single, k8s-hosted etcd while @jamiehannaford's document documents the classic approach of external etcd (which is exactly what I have used for my other non-HA POC efforts).

What do you all recommend? External etcd, single self-hosted, or locating and using the "seed" etcd (with pivot to k8s-hosted)? If the last - what guide or documentation do you suggest?

TIA!

jamiehannaford commented 6 years ago

@andybrucenet External etcd is recommended for HA setups (at least at this moment in time). CoreOS has recently dropped support for any kind of self-hosted. It should only really be used for dev, staging or casual clusters.

mbert commented 6 years ago

@andybrucenet Not quite - I am using an external etcd cluster just like @jamiehannaford proposes in his guide. Actually the approaches described in our respective documents should be fairly similar. It is based on setting up the etcd cluster you feel you need and then have kubeadm use it when bootstrapping the Kubernetes cluster.

I am currently more or less about to finish my guide and the ansible-based implementation by documenting and implementing a working upgrade procedure - that (and some bugfixes) should be done sometime next week.

Not quite sure whether there will be any need to further transfer my guide into yours, @jamiehannaford what do you think?

mbert commented 6 years ago

Actually the hostname-override was unnecessary. When running kubeadm upgrade apply, some default settings overwrite my adaptations, e.g. NodeRestriction gets re-activated (also my scaling of Kube DNS instances gets reset, but this was of course not a show stopper here). Patching the NodeRestriction admission rule out of /etc/kubernetes/manifests/kube-apiserver.yaml did the trick.

mbert commented 6 years ago

I have now written a chapter on upgrading HA clusters to my HA setup guide.

Also I have added code for automating this process to my ansible project on github. Take a look into the README.md file there for more information.

mattkelly commented 6 years ago

@mbert for the upgrade process you've outlined, what are the exact reasons for manually copying the configs and manifests from /etc/kubernetes on the primary master to the secondary masters rather than simply running kubeadm upgrade apply <version> on the secondary masters as well?

mbert commented 6 years ago

@mattkelly It seemed rather dangerous to me. Since the HA cluster's masters use an active/passive setup, but kubeadm knows about only one master I found running it again on a different master risky. I may be wrong though.

mbert commented 6 years ago

Replying to myself: Having looked at Jamie's guide on kubernetes.io, running kubeadm on the masters may work, even when setting up the cluster. I'll try this out next week and probably make some changes to my documents accordingly.

mattkelly commented 6 years ago

FWIW, running kubeadm on the secondary masters seems to have worked just fine for me (including upgrade) - but I need to better understand the exact risks at each stage. I've been following @jamiehannaford's guide which is automated by @petergardfjall's hakube-installer (no upgrade support yet though, so I tested that manually).

Edit: Also important to note is that I'm only testing on v1.9+. Upgrade was from v1.9.0 to v1.9.2.

mbert commented 6 years ago

I have now followed the guide on kubernetes.io that @jamiehannaford created, i.e. ran kubeadm init on the all master machines (after having copied /etc/kubernetes/pki/ca.* to the secondary masters). This works just fine for setting up the cluster. In order to be able to upgrade to v1.9.2 I am setting up v1.8.3 here.

Now I am running into trouble when trying to upgrade the cluster: Running kubeadm upgrade apply v1.9.2 on the first master fails:

[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests872757515/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests872757515/kube-scheduler.yaml"
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests647361774/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

This step fails reproducably (I always start from the scratch, i.e. remove all configuration files plus etcd data from all nodes before starting a new setup).

I tried out several variations, but no success:

Have kubelet use the local API Server instance or the one pointed to by the virtual IP
have kube-proxy use the local API Server instance or the one pointed to by the virtual IP

I have attached some logs. However I cannot really find any common pattern that would explain this problem to me. Maybe it is something I just don't know?

upgrade-failed-proxy-on-vip.log upgrade-failed-proxy-and-kubelet-on-vip.log upgrade-failed-proxy-and-kubelet-on-local-ip.log

mbert commented 6 years ago

Having tried out another few things it boils down to the following:

Updating the master which was setup last (i.e. the one on which kubeadm init was run last when setting up the cluster) works.
I can get the other nodes working, too, if I edit configmap/kubeadm-config and change the value for MasterConfiguration.nodeName in there to the respective master's host name or simply delete that line.

Others like @mattkelly have been able to perform the upgrade without editing configmap/kubeadm-config, hence the way I set things up must be somehow different.

Anybody got a clue what I should change, so that upgrading works without this (rather dirty) trick?

I have tried upgrading from both 1.8.3 and 1.9.0 to 1.9.2, with the same result.

mattkelly commented 6 years ago

@mbert I'm now reproducing your issue from a fresh v1.9.0 cluster created using hakube-installer. Trying to upgrade to v1.9.3. I can't think of anything that has changed with my workflow. I'll try to figure it out today.

I verified that deleting the nodeName line from configmap/kubeadm-config for each subsequent fixes the issue.

mbert commented 6 years ago

Thank you, that's very helpful. I have now added patching configmap/kubeadm-config to my instructions.

mattkelly commented 6 years ago

@mbert oops, I figured out the difference :). For previous upgrades I had been providing the config generated during setup via --config (muscle memory I guess). This is why I never needed the workaround. I believe that your workaround is more correct in case the cluster has changed since init time. It would be great to figure out how to avoid that hack, but it's not too bad in the meantime - especially compared to all of the other workarounds.

ReSearchITEng commented 6 years ago

Hello, Will kubeadm 1.10 remove any of the pre-steps/workarounds currently required for HA in 1.9 ? E.g. the manual creation of a bootstrap etcd, generation of etcd keys, etc?

timothysc commented 6 years ago

Closing this item as 1.10 doc is out and we will be moving to further the HA story in 1.11

/cc @fabriziopandini

kubernetes / kubeadm

Workarounds for the time before kubeadm HA becomes available #546