Workarounds for the time before kubeadm HA becomes available

mbert commented 6 years ago

The planned HA features in kubeadm are not going to make it into v1.9 (see #261). So what can be done to make a cluster setup by kubeadm sufficiently HA?

This is what it looks like now:

Worker nodes can be scaled up to achieve acceptable redundance.
Without a working active/active or at least active/passive master setup, master failures are likely to cause significant disruptions.

Hence an active/active or active/passive master setup needs to be created (i.e. mimic what kubeadm would supposedly be doing in the futue):

Replace the local etcd pod by an etcd cluster of min. 2 x number-of-masters size. This cluster could running on the OS rather than in K8s.
Set up more master instances. That's the interesting bit. The Kubernetes guide for building HA clusters (https://kubernetes.io/docs/admin/high-availability/) can be of help to understand what needs to be done. Here I'd like to have simple step-by-step instructions taking into consideration kubeadm-setup particularities in the end.
Not sure whether this is necessary: Probably set up haproxy/keepalived on the master hosts, move the original master's IP address plus SSL termination to it.

This seems achievable if converting the existing master instance to a cluster of masters (2) can be done (the Kubernetes guide for building HA clusters seems to indicate so). Active/active would be not more expensive than active/passive.

I am currently working on this. If I succeed I shall share what I find out here.

mbert commented 6 years ago

See also https://github.com/cookeem/kubeadm-ha - this seems to cover what I want to achieve here.

luxas commented 6 years ago

@mbert we started implementing the HA features and chopped wood on the underlying dependency stack now in v1.9, but it's a short cycle for a big task, so the work will continue in v1.10 as you pointed out.

For v1.9, we will document what you're describing here in the official docs though; how to achieve HA with external deps like setting up a LB

mbert commented 6 years ago

Excellent. I am digging through all this right now. I am currently stuck at bootstrapping master 2 and 3, in particular how to configure kubelet and apiserver (how much can I reuse from master 1?) and etcd (I am thinking of using a bootstrap etc on a separate machine for discovery). The guide from the docs is a bit terse when it comes to this.

kcao3 commented 6 years ago

@mbert I have been following your comments here and I just want to let you know I followed the guide in docs and was able to stand up a working HA k8s cluster using kubeadm (v1.8.x).

If you are following this setup and you need to bootstrap master 2 and 3, you can reuse almost everything from the first master. You then need to fix up the following configuration files on master 2 and 3 to reflect the current host: /etc/kubernetes/manifests/kube-apiserver.yaml, /etc/kubernetes/kubelet.conf, /etc/kubernetes/admin.conf, and /etc/kubernetes/controller-manager.conf

Regarding etcd, if you follow this guide docs you should stand up an external 3-node etcd cluster that spans across the 3 k8s master nodes.

There is also one 'gotcha' item that has NOT yet been covered in the guide docs. You can see this issue for detail: https://github.com/cookeem/kubeadm-ha/issues/6

I also asked a few questions related to kubeadm HA from this post: https://github.com/cookeem/kubeadm-ha/issues/7

I really hope that can give me some thoughts on these.

Thank you in advance for your time.

srflaxu40 commented 6 years ago

This is great - definitely need this as I am sure 99% of kubeadm users have a nagging paranoia in the back of their heads about ha of their master(s).

mbert commented 6 years ago

@kcao3 thank you. I will look into this all on coming Monday. So I understand that it is OK to use identical certificates on all three masters?

If yes, I assume that next I'll try will be bring up kubelet and apiserver on master 2 and 3 using the configuration from master 1 (with modified IPs and host names in there of course) and then bootstrap the etcd cluster by putting a modified etcd.yaml into /etc/kubernetes/manifests.

Today I ran into problems because the running etcd on master 1 already had cluster information in its data dir which I had to remove first, but I was still running into problems. I guess some good nights of sleep will be helpful.

Once I've got this running I shall document the whole process and publish it.

mbert commented 6 years ago

@srflaxu40 yep, and in particular if you have an application that indirectly requires apiserver at runtime (legacy application and service discovery in my case) you cannot afford to lose the only master at any time.

mbert commented 6 years ago

Convert the single-instance etcd to a cluster

I have been able to get replacing the single etcd instance by a cluster in a fresh K8s cluster. The steps are roughly these:

Set up a separate etcd server. This etcd instance is only needed for bootstrapping the cluster. Generate a discovery URL for 3 nodes on it (see https://coreos.com/etcd/docs/latest/op-guide/clustering.html#etcd-discovery).
Copy /etc/kubernetes from master 1 to masters 2 and 3. Substitute host name and IP in /etc/kubernetes/*.* and /etc/kubernetes/manifests/*.*
Create replacements to /etc/kubernetes/manifests/etcd.yaml for all three masters: set all announcement URLs to the respective hosts' primary IPs, all listen URLs to 0.0.0.0, add the discovery URL from step 1. I used the attached JINJA2 template file etcd.yaml.j2.txt together with ansible.
Copy the etcd.yaml replacements to /etc/kubernetes/manifests on all three master nodes.
Now things get time critical. Wait for the local etcd process to terminate, then move /var/lib/etcd/member/wal somewhere else before the new process comes up (otherwise it will ignore the discovery URL).
When the new etcd comes up it will now wait for the remaining two instances to join. Hence, quickly launch kubelet on the other two master nodes.
Follow the etcd container's logs on the first master to see if something went completely wrong. If things are OK, then after some minutes the cluster will be operational again.

Step 5 is somewhat awkward, and I have found that if I miss the right time here or need too much time to get the other two masters to join (step 6) my cluster gets into a state from which it can hardly recover. When this happened, the simplest solution I found was to shut down kubelet on master 2 and 3, run kubeadm reset on all masters and minions, clear the /var/lib/etcd directories on all masters and set up a new cluster using kubeadm init.

While this works, I'd be interested in possible improvements: Is there any alternative, more elegant and robust approach to this (provided that I still want to follow the approach of running etcd in containers on the masters)?

This comment aims to collect feedback and hints at an early stage. I will post updates on the next steps in a similar way before finally documenting this as a followable guide.

KeithTt commented 6 years ago

@mbert Why do not you use a independent ETCD cluster instead of creating in the k8s?

mbert commented 6 years ago

@KeithTt Thank you for your feedback. I was thinking about these here:

Not to use any data.
Stay as close to kubeadm's setup as possible.
Have it supervised by K8s and integrated in whatever monitoring I set up for my system.
Keep the number of services running on the OS low.
It wouldn't make things easier since I'd still have to deal with (4) above.

If an independent etcd cluster's advantages outweigh the above list, I shall be happy to be convinced otherwise.

luxas commented 6 years ago

@mbert Please make sure you sync with @jamiehannaford on this effort, he's also working on this / committed to making these docs a thing in v1.9

@mbert are you available to join our SIG meeting today 9PT or the kubeadm implementation PR tomorrow 9PT? I'd love to discuss this with you in a call :+1:

mbert commented 6 years ago

@luxas actually it was @jamiehannaford who asked me to open this issue. Once I have got things running and documented I hope to get lots of feedback from him. 9PT, that's in an hour, right? That would be fine. Just let me know how to connect with you.

bitgandtter commented 6 years ago

Following guides here and there i manage to do it here is my final steps

timothysc commented 6 years ago

/cc @craigtracey

dimitrijezivkovic commented 6 years ago

@mbert

Created - not converted - 3 master-node cluster using kubeadm with 3 node etcd cluster deployed on kubernetes

Here's what I needed to do:

Create 3 master node cluster using kubeadm on barebone servers
Deploy etcd cluster on 3 master nodes using kubeadm
Use non-default pod-network cidr /27

Problems:

Using non-default pod-network cidr is impossible to setup using kubeadm init
No documentation on creating multi-master cluster on barebone exists. Other docs are not as detailed as could be

The way I did it was using kubeadm alpha phase steps, short list follows:

on all master nodes:

Start docker - not kubelet

on masternode1:

Create CA certs
Create apiserver certs with --apiserver-advertise-address, --service-cidr, --apiserver-cert-extra-sans parameters used. Here, really only --apiserver-cert-extra-sans is mandatory.
Create rest of the certs needed
Create kubeconfig and controlplane configs
Edit newly created yaml files in /etc/kubernetes/manifest directory to add any extra options you need. For me, here's where I set non-default pod-network CIDR of /27 in kube-control-manager.yaml. Also, remove NodeRestriction from --admission-control
Copy previously prepared yaml file for etd cluster in /etc/kubernetes/manifest directory
Copy /etc/kubernetes directory to rest of the master nodes and edit all the files needed to configure them for masternode2 and masternode3.
Once all files are reconfigured, start kubelet ON ALL 3 MASTER NODES.
Once all nodes are up, taint all master-nodes
Bootstrap all tokens
Create token for joining worker nodes
Edit previously created masterConfig.yaml and update token parameter
Upload masterConfig to kubernetes
Install addons
Generate --discovery-token-ca-cert-hash and add worker nodes

This is really short list of what I did and it can be automated and reproduced in 5 minutes. Also, for me the greatest bonus was I was able to set non-standard pod-network CIDR as I had that restriction of not being able to spare B class IP address range.

If you're interested in more detailed version, please let me know and I'll try and create some docs on how this was done.

mbert commented 6 years ago

@dimitrijezivkovic thank you for your comment. I think it would make sense to put all the relevant information together so that one piece of documentation comes out.

I plan to set up a google docs document and start documenting what I did (which is pretty bare-bones). I would then invite others to join and write extensions, corrections, comments?

mbert commented 6 years ago

I have now "documented" a very simple setup in form of a small ansible project: https://github.com/mbert/kubeadm2ha

It is of course still work in progress, but it already allows to set up a multi-master cluster without any bells and whistles. I have tried to keep it as simple as possible so that by reading one should be able to find out pretty easily what needs to be done in which order.

Tomorrow I will start writing this up as a simple cooking recipe in a google docs document and invite others to collaborate.

anguslees commented 6 years ago

Just to call it out explicitly, there's a bunch of orthogonal issues mashed together in the above conversation/suggestions. It might be useful to break these out separately, and perhaps prioritise some above others:

etcd data durability (multi etcd. Requires 2+ etcd nodes)
etcd data availability (multi etcd+redundancy. Requires 3+ etcd nodes)
apiserver availability (multi apiserver. Requires a loadbalancer/VIP or (at least) DNS with multiple A records)
cm/scheduler availability (multi cm/scheduler. Requires 2+ master nodes, and replicas=2+ on these jobs)
reboot-all-the-masters recovery (a challenge for self-hosted - requires some form of persistent pods for control plane)
kubeadm upgrade support for multi-apiserver/cm-scheduler (varies depending on self-hosted vs non-self-hosted)

Imo the bare minimum we need is etcd durability (or perhaps availability), and the rest can wait. That removes the "fear" factor, while still requiring some manual intervention to recover from a primary master failure (ie: an active/passive setup of sorts).

I think the details of the rest depend hugely on self-hosted vs "legacy", so I feel like it would simplify greatly if we just decided now to assume self-hosted (or not?) - or we clearly fork the workarounds/docs into those two buckets so we don't confuse readers by chopping and changing.

Aside: One of the challenges here is that just about everything to do with install+upgrade changes if you assume a self-hosted+HA setup (it mostly simplifies everything because you can use rolling upgrades, and in-built k8s machinery). I feel that by continually postponing this setup we've actually made it harder for ourselves to reach that eventual goal, and I worry that we're just going to keep pushing the "real" setup back further while we work on perfecting irrelevant single-master upgrades :( I would rather we addressed the HA setup first, and then worked backwards to try to produce a single-host approximation if required (perhaps by packing duplicate jobs temporarily onto the single host), rather than trying to solve single-host and then somehow think that experience will help us with multi-host.

KeithTt commented 6 years ago

@mbert I have achieved the HA proposal by generating the certs manually for each node, and without deleting NodeRestriction, I use haproxy+keepalived as loadbalancer now, maybe lvs+keepalived will be better, I will document the details in this weekend, hope to share with u.

luxas commented 6 years ago

FYI all, @mbert has started working on a great WIP guide for kubeadm HA manually that we'll add to the v1.9 kubeadm docs eventually: https://docs.google.com/document/d/1rEMFuHo3rBJfFapKBInjCqm2d7xGkXzh0FpFO0cRuqg/edit

Please take a look at the doc everyone, and provide your comments. We'll soon-ish convert this into markdown and send as a PR to kubernetes/website.

Thank you @mbert and all the others that are active in thread, this will be a great collaboration!

anguslees commented 6 years ago

@mbert / @luxas: that doc doesn't allow comments (for me at least :cry:)

mbert commented 6 years ago

Done, I had the wrong setting in the doc.

kcao3 commented 6 years ago

@mbert I have a question for you. Following your approach, assuming I have a functioning HA k8s cluster. Do you know how to add new k8s masters to my existing cluster? The issue I am facing now is the certs that were generated based on the FIXED number of k8s master hosts at the time the cluster was bootstrapped. This now prevents any new master to join the cluster. From the kubelet's log of the new master, you would see something like this: "... x509: certificate is valid for 192.168.1.x, 192.168.1.y,192.168.1.z not 192.168.1.n." ( where .x,.y,.z are the IP address of the current masters, and .n is the address of the new master). Do you know how to resolve this issue? Do the master nodes must use the same certificates in this case?

mbert commented 6 years ago

@kcao3 I am not very familiar with this particular aspect. Maybe @jamiehannaford can tell you more about this?

jamiehannaford commented 6 years ago

@kcao3 Each master join will generate TLS assets using the specific IPv4 for that server. The config also accepts additional SANs, which should include the LB IPv4 which sits in front of the masters. I have a HA guide in review, so check that out if you have time.

mbert commented 6 years ago

I have just pushed a new commit to https://github.com/mbert/kubeadm2ha

flannel networking is now supported (and default)
there's a basic installation for the dashboard (NodePort network, insecure, i.e. no SSL) as a separate playbook
code cleanup

kcao3 commented 6 years ago

@mbert I just read the HA guide from @jamiehannaford : https://github.com/jamiehannaford/kubernetes.github.io/blob/3663090ea9b9a29a00c79dd2916e11737ccf1802/docs/setup/independent/high-availability.md. Is it possible on each of the master node, we can have kubeadm to generate and signed separate certificates using the same CA.crt and CA.key?

So the only things that need to be copied from the primary master to the secondary masters are the CA.crt and CA.key. With this approach, on each master (including primary and secondary), we will run 'kubeadm init' using a generated kubeadm configuration file based on a template like the following:

apiVersion: kubeadm.k8s.io/v1alpha1
kind: MasterConfiguration
kubernetesVersion: v{{ KUBERNETES_VERSION }}
networking:
  podSubnet: {{ POD_NETWORK_CIDR }}
api:
  advertiseAddress: {{ MASTER_VIP }}
apiServerCertSANs:
- {{ MASTER_VIP }}
etcd:
  endpoints:
{% for host in groups['masters'] %}
  - http://{{ hostvars[host]['ansible_default_ipv4']['address'] }}:2379
{% endfor %}

If this approach works, it will allow k8s admins to add any new master to their existing multi-masters cluster down the road.

Any thoughts?

discordianfish commented 6 years ago

@kcao3 That's what I'm trying to do. I figured out I also need to pre-generate proxy CA cert+keys which are different. But now when I run kubeadm init on my masters, all components come up properly but the kube-proxy still fails due to authentication issues, even though the front-proxy-client.crt is now signed by the same CA on all nodes.

jamiehannaford commented 6 years ago

@discordianfish I also ran into auth issues but when deploying Flannel. Wonder if it's related to what you're seeing.

discordianfish commented 6 years ago

In the meantime I figured out the the 'proxy CA' (frontend-proxy-*) isn't related to kube-proxy. Still trying to figure out what is going on, it looks though like there is no system:node-proxier role but I don't know what is suppose to create it.

discordianfish commented 6 years ago

Since the frontend-proxy stuff was a red herring, I'm starting over on a clean slate now. But would be great if someone could confirm that it should work to create the CA credentials and just run init on all masters? Given the right advertiseAddress, SANs and etcd endpoints of course? Because I'm most worried that kubeadm still somehow generates local secrets other masters don't know about.

discordianfish commented 6 years ago

When my masters come up, kube-proxy is working first but kube-proxy on the last master fails. When I recreated the pods, all fail. So when running kubeadm init again the same etcd multiple times from different hosts, it somehow breaks the authentication.

The service account looks correct and has a secret:

$ kubectl -n kube-system get ds kube-proxy -o yaml|grep serviceAccount
      serviceAccount: kube-proxy
      serviceAccountName: kube-proxy

$ kubectl -n kube-system get sa kube-proxy -o yaml|grep -A1 secrets
secrets:
- name: kube-proxy-token-5ll9k

$ kubectl -n kube-system get secret kube-proxy-token-5ll9k
NAME                     TYPE                                  DATA      AGE
kube-proxy-token-5ll9k   kubernetes.io/service-account-token   3         16m

This service account is bound to a role too:

$ kubectl get clusterrolebindings kubeadm:node-proxier -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: 2017-12-07T12:52:54Z
  name: kubeadm:node-proxier
  resourceVersion: "181"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/kubeadm%3Anode-proxier
  uid: 8a9638df-db4d-11e7-8d7e-0e580b140468
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node-proxier
subjects:
- kind: ServiceAccount
  name: kube-proxy
  namespace: kube-system

And the role exist and is looking good:

$ kubectl get clusterrole system:node-proxier -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    rbac.authorization.kubernetes.io/autoupdate: "true"
  creationTimestamp: 2017-12-07T12:52:51Z
  labels:
    kubernetes.io/bootstrapping: rbac-defaults
  name: system:node-proxier
  resourceVersion: "63"
  selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/system%3Anode-proxier
  uid: 88dfc662-db4d-11e7-8d7e-0e580b140468
rules:
- apiGroups:
  - ""
  resources:
  - endpoints
  - services
  verbs:
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
  - update

So not sure what is going one. From how I understand everything, this should work like this but the apiserver keeps logging: E1207 13:18:20.697707 1 authentication.go:64] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, crypto/rsa: verification error]]

discordianfish commented 6 years ago

Okay, so it looks like the token is only accepted by one instance of my apiservers, probably on the master where kubeadm init last ran. I thought the service account tokens get stored in etcd?

discordianfish commented 6 years ago

Mystery solved thanks for gintas and foxie in #kubernetes-users: We also need to pre-generate the sa keys and distribute them along with the CA.

petergardfjall commented 6 years ago

I followed @jamiehannaford's HA guide fairly closely and eventually reached a working HA cluster (set up in a Vagrant setting with a HAProxy load-balancer fronting three master nodes), but I hit a few obstacles along the way and thought I'd share them here since they are probably relevant irrespective of approach:

It is important that the etcd version is compatible with the Kubernetes version your running. From what I can gather the guide targets k8s 1.9 and therefore uses etcd v3.1.10. For a k8s 1.8 installation (which I was targeting) , you should use v3.0.17 (using v3.1.17 caused kubeadm to choke, failing to extract the etcd version).
I had to run etcd using systemd, since running it as a static pods under /etc/kubernetes/manifests would cause kubeadm preflight checks to fail (it expects that directory to be empty).
Before running kubeadm init on master1 and master2, you need to wait for master0 to generate certificates and, in addition to /etc/kubernetes/pki/ca.{crt,key}, copy the /etc/kubernetes/pki/sa.key and /etc/kubernetes/pki/sa.pub files to master1 and master2 (as hinted by @discordianfish). Otherwise, master1 and master2 will generate service account token signing certificates of their own, which in my case caused kube-proxy on those hosts to fail to authenticate against the apiserver.

There are also the files front-proxy-ca.{crt,key} and front-proxy-client.{crt,key} which I did not copy. I'm unsure if they should have been copied from master0 as well, but things appear to be working anyway.
The "regular" kubeadm installation guide encourages you to configure Docker to use the systemd cgroup driver. For me, that also required me to pass --cgroup-driver=systemd to the kubelet via KUBELET_EXTRA_ARGS.

discordianfish commented 6 years ago

@petergardfjall Ha it's funny to see how you run into exactly the same issues. So yeah as of yesterday my multi-HA cluster also works. I ran into https://github.com/kubernetes/kubeadm/issues/590 though, did you find a nice solution for that? Didn't had to use a special etcd version. I think I'm just using the defaults in coreos' stable etcd-wrapper. Regarding the front-proxy stuff.. I frankly have no idea what it is.

petergardfjall commented 6 years ago

@discordianfish: I did not run into #590 . I used a kubeadm config file with

api:
  advertiseAddress: <apiserver-loadbalancer-ip>

and it appears to have been picked up by the kube-proxy config map.

> kubectl get cm -n kube-system kube-proxy -o yaml
apiVersion: v1
data:
  kubeconfig.conf: |
    apiVersion: v1
    kind: Config
    clusters:
    - cluster:
        certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        server: https://<apiserver-loadbalancer-ip>:6443
      name: default

discordianfish commented 6 years ago

Ah okay. Right, it works with a load balancer ip but you don't get a stable IP when running on AWS and using an ELB, so need to use a name.

petergardfjall commented 6 years ago

@discordianfish I see, that may actually become a problem since I'm planning on running it in AWS later on. How did you work around that?

petergardfjall commented 6 years ago

@jamiehannaford in the HA guide you make references to using cloud-native loadbalancers. Did you experiment with that? Did you manage to get around #590?

discordianfish commented 6 years ago

No, haven't found a solution yet. Right now it's just a note in my docs to edit this config map manually.

discordianfish commented 6 years ago

And I just shot me in the foot with this: kubeadm init on a new master will overwrite the configmap and https://github.com/kubernetes/kubernetes/issues/57109 makes it even harder to realize this.

So from what I can tell there is no way to use kubeadm right now in a multi-master setup, without falling back to executing alpha phases manually.

@jamiehannaford's HA guide misses this in general. A cluster created like this will have the IP of a single master hardcoded and breaks ones this goes away.

klausenbusk commented 6 years ago

Hello

I just experimented a bit with this and I think I have a working setup now. So here is what I did:

The experiment was performed on DigtialOcean with 4x 20$ droplets (3 master + 1 worker)

First I created 3 droplet (CoreOS stable):

master1: 188.166.76.108
master2: 188.166.29.53
master3: 188.166.76.133

I then rain the following script on every node to configure the needed pieces to use kubeadm with CoreOS:

#!/bin/bash
set -o nounset -o errexit

RELEASE="$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"
CNI_VERSION="v0.6.0"

mkdir -p /opt/bin
cd /opt/bin
curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/amd64/{kubeadm,kubelet,kubectl}
chmod +x {kubeadm,kubelet,kubectl}

mkdir -p /opt/cni/bin
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-amd64-${CNI_VERSION}.tgz" | tar -C /opt/cni/bin -xz

BRANCH="release-$(cut -f1-2 -d .<<< "${RELEASE##v}")"
cd "/etc/systemd/system/"
curl -L "https://raw.githubusercontent.com/kubernetes/kubernetes/${BRANCH}/build/debs/kubelet.service" | sed 's:/usr/bin:/opt/bin:g' > kubelet.service
mkdir -p "/etc/systemd/system/kubelet.service.d"
cd "/etc/systemd/system/kubelet.service.d"
curl -L "https://raw.githubusercontent.com/kubernetes/kubernetes/${BRANCH}/build/debs/10-kubeadm.conf" | sed 's:/usr/bin:/opt/bin:g' > 10-kubeadm.conf

Create the initial master:

core@master-01 ~ $ sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-cert-extra-sans="127.0.0.1,188.166.76.108,188.166.29.53,188.166.76.133"
[...]
  kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
[...]
core@master-01 ~ $ sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml
core@master-01 ~ $ sudo systemctl enable kubelet docker

Next we need to create a etcd cluster, so change the etcd manifest so etcd listen for peers on all interfaces (WARNING: This isn't safe, in production you should at least use TLS for peer authentication/communication)

core@master-01 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# add --listen-peer-urls=http://0.0.0.0:2380 as a command arg
core@master-01 ~ $ sudo systemctl restart kubelet # for some reason, kubelet does not pick up the change

Change the default etcd member peer-url to the public ipv4 ip:

core@master-01 ~ $ ETCDCTL_API=3 etcdctl member list
8e9e05c52164694d, started, default, http://localhost:2380, http://127.0.0.1:2379

core@master-01 ~ $ ETCDCTL_API=3 etcdctl member update 8e9e05c52164694d --peer-urls="http://188.166.76.108:2380"

Now copy all the kubernetes files (manifests/pki) to the other master nodes:

$ eval $(ssh-agent)
$ ssh-add <path to ssh key>
$ ssh -A core@188.166.29.53 # master-02
core@master-02 ~ $ sudo -E rsync -aP --rsync-path="sudo rsync" core@188.166.76.108:/etc/kubernetes/ /etc/kubernetes
$ ssh -A core@188.166.76.133 # master-03
core@master-03 ~ $ sudo -E rsync -aP --rsync-path="sudo rsync" core@188.166.76.108:/etc/kubernetes/ /etc/kubernetes

Add master-02 to the etcd cluster:

core@master-01 ~ $ ETCDCTL_API=3 etcdctl member add member-02 --peer-urls="http://188.166.29.53:2380"
Member  b52af82cbbc8f30 added to cluster cdf818194e3a8c32

ETCD_NAME="member-02"
ETCD_INITIAL_CLUSTER="member-02=http://188.166.29.53:2380,default=http://188.166.76.108:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

$ ssh core@188.166.29.53 # master-02
core@master-02 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# Add the following as args:
--name=member-02
--initial-cluster=member-02=http://188.166.29.53:2380,default=http://188.166.76.108:2380
--initial-cluster-state=existing
core@master-02 ~ $ sudo systemctl restart kubelet

Add master-03 to the etcd cluster:

core@master-01 ~ $ ETCDCTL_API=3 etcdctl member add master-03 --peer-urls="http://188.166.76.133:2380"
Member 874cba873a1f1e81 added to cluster cdf818194e3a8c32

ETCD_NAME="master-03"
ETCD_INITIAL_CLUSTER="member-02=http://188.166.29.53:2380,master-03=http://188.166.76.133:2380,default=http://188.166.76.108:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

$ ssh core@188.166.76.133 # master-03
core@master-03 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# Add the following as args:
--name=master-03
--initial-cluster=member-02=http://188.166.29.53:2380,master-03=http://188.166.76.133:2380,default=http://188.166.76.108:2380
--initial-cluster-state=existing
core@master-03 ~ $ sudo systemctl start kubelet

So now we should have a 3-node etcd cluster.

Now lets master-02 and master-03 join the k8s cluster:

$ ssh core@188.166.29.53 # master-02
core@master-02 ~ $ sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf
core@master-02 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
$ ssh core@188.166.76.133 # master-03
core@master-03 ~ $ sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf
core@master-03 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a

Mark them as masters:

core@master-01 ~ $ sudo kubeadm alpha phase mark-master --node-name master-02
core@master-01 ~ $ sudo kubeadm alpha phase mark-master --node-name master-03

Change kubelet,kube-scheduler and kube-controller-manager to use the local apiserver instead of master-01 apiserver:

core@master-01 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}
core@master-02 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}
core@master-03 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}

Change kube-apiserver yaml file to advertise the correct ip and health checking ip:

core@master-02 ~ $ sudo sed 's/188.166.76.108/188.166.29.53/g' -i /etc/kubernetes/manifests/kube-apiserver.yaml
core@master-03 ~ $ sudo sed 's/188.166.76.108/188.166.76.133/g' -i /etc/kubernetes/manifests/kube-apiserver.yaml

Enable kubelet, docker and reboot:

core@master-01 ~ $ sudo systemctl enable kubelet docker; sudo reboot
core@master-02 ~ $ sudo systemctl enable kubelet docker; sudo reboot
core@master-03 ~ $ sudo systemctl enable kubelet docker; sudo reboot

Change kube-proxy to use the apiserver on localhost:

core@master-01 ~ $ sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf -n kube-system edit configmap kube-proxy
# Change server: https://<ip>:6443 to https://127.0.0.1:6443

Now lets try adding a worker node (run the script at the top): worker-01: 178.62.216.244

$ ssh core@178.62.216.244
core@worker-01 ~ $ sudo iptables -t nat -I OUTPUT -p tcp -o lo --dport 6443 -j DNAT --to 188.166.76.108
core@worker-01 ~ $ sudo iptables -t nat -I POSTROUTING -o eth0 -j SNAT --to-source $(curl -s ipinfo.io | jq -r .ip)
core@worker-01 ~ $ sudo sysctl net.ipv4.conf.eth0.route_localnet=1
core@worker-01 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 127.0.0.1:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
core@worker-01 ~ $ sudo systemctl enable kubelet docker

Now we just need to add a local loadbalancer to the worker node, and everything is done. Save the following as /etc/nginx/nginx.conf on the worker-01 node:

error_log stderr notice;

worker_processes auto;
events {
    use epoll;
    worker_connections 1024;
}

stream {
    upstream kube_apiserver {
        least_conn;
        server 188.166.76.108:6443 max_fails=3 fail_timeout=30s;
        server 188.166.29.53:6443 max_fails=3 fail_timeout=30s;
        server 188.166.76.133:6443 max_fails=3 fail_timeout=30s;
    }

    server {
        listen 127.0.0.1:6443 reuseport;
        proxy_pass kube_apiserver;
        proxy_timeout 10m;
        proxy_connect_timeout 1s;

    }
}

Create /etc/kubernetes/manifests

core@worker-01 ~ $ sudo mkdir /etc/kubernetes/manifests

Add a static nginx-proxy manifest as /etc/kubernetes/manifests/nginx-proxy.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: nginx-proxy
  namespace: kube-system
  labels:
    k8s-app: kube-nginx
spec:
  hostNetwork: true
  containers:
  - name: nginx-proxy
    image: nginx:1.13-alpine
    imagePullPolicy: Always
    resources:
      limits:
        cpu: 200m
        memory: 128M
      requests:
        cpu: 50m
        memory: 32M
    volumeMounts:
    - mountPath: /etc/nginx
      name: etc-nginx
      readOnly: true
  volumes:
  - name: etc-nginx
    hostPath:
      path: /etc/nginx

Reboot the node and the temporary iptables rules should be gone, and everything should work as expected.

A long post, but it shows that it is doable :)

Edit: Forgot to change the API server for the worker node: sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{bootstrap-kubelet.conf,kubelet.conf} Edit2: Should also change kubectl --kubeconfig=admin.conf -n kube-public get configmap cluster-info

luxas commented 6 years ago

@klausenbusk Great :tada:! If you want to carry/improve https://github.com/kubernetes/website/pull/6458, feel free to send a PR with more details on what you did to help @jamiehannaford which is on vacation at the moment.

bot11 commented 6 years ago

@klausenbusk , on the master-02 and master-03 , I don't understand how you were able to join? Since the /etc/kubernetes directory is not empty. Can you please clarify if there is a step missing? Thanks.

klausenbusk commented 6 years ago

@klausenbusk , on the master-02 and master-03 , I don't understand how you were able to join? Since the /etc/kubernetes directory is not empty. Can you please clarify if there is a step missing?

I did remove sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf as documented, removing the whole directory wasn't needed.

petergardfjall commented 6 years ago

To @discordianfish and others wanting to run a HA setup on AWS.

I did manage to get a HA setup to work with Amazon's ELB (despite it not having a single static IP address).

To get it to work, the following steps (in addition to @jamiehannaford's HA guide) need to be taken:

Since the ELB does not have a static IP address, we cannot use that as the apiserver advertise address. Instead, we let each master advertise its own private IP address.

The down-side of this approach seems to be that apiservers will "fight" over the endpoint record, rewriting it every now and then (as can be seen via kubectl get endpoints) which, in turn, has consequences for kube-proxy, which will rewrite its iptables whenever a change is detected.

This doesn't appear to harm the correctness of Kubernetes, but I guess it can lead to some performance degradation in large clusters. Any thoughts?

The issue is discussed in greater detail here.
All worker kubelets and kube-proxies need to access the API servers via the load-balancers FQDN. Since kubeadm doesn't allow us to specify different servers for kube-proxy and worker kubelets (they will simply use the IP address of the apiserver that they happended to connect to at kubeadm join) we need to take care of this ourselves.
- The kube-proxy configuration is stored as a configmap, which gets overwritten every time kubeadm init is run (once for every master node). Therefore, for each kubeadm init we need to patch the configmap as follows:
  
  kubectl get configmap -n kube-system kube-proxy -o yaml > kube-proxy.cm sudo sed -i 's#server:.*#server: https://:6443#g' kube-proxy.cm kubectl apply -f kube-proxy.cm --force
  
  restart all kube-proxy pods to ensure that they load the new configmap
  
  kubectl delete pod -n kube-system -l k8s-app=kube-proxy
- On each worker we need to patch the kubelet configuration after join, so that the kubelet connects via the load-balancer.
  
  sudo kubeadm join --config=kubeadm-config.yaml
  
  /etc/kubernetes/kubelet.conf may not be immediately present
  
  wait_for 60 [ -f /etc/kubernetes/kubelet.conf ] sudo sed -i 's#server:.*#server: https://:6443#g' /etc/kubernetes/kubelet.conf sudo systemctl restart kubelet

With this approach I seem to have a working cluster where one master at a time can go down without (apiserver) service disruption.

klausenbusk commented 6 years ago

This doesn't appear to harm the correctness of Kubernetes, but I guess it can lead to some performance degradation in large clusters. Any thoughts?

You can switch to the new lease reconciler in 1.9, it should fix the "fighting" over the endpoint issue.

petergardfjall commented 6 years ago

Excellent advice @klausenbusk. It worked like a charm.

jamiehannaford commented 6 years ago

@petergardfjall

they will simply use the IP address of the apiserver that they happended to connect to at kubeadm join

What happens if you do kubeadm join with the LB's IP?

In terms of the kubelet, I think that's a necessary manual edit. Need to add to HA guide.

kubernetes / kubeadm

Workarounds for the time before kubeadm HA becomes available #546

Convert the single-instance etcd to a cluster

restart all kube-proxy pods to ensure that they load the new configmap

/etc/kubernetes/kubelet.conf may not be immediately present