Closed mbert closed 6 years ago
See also https://github.com/cookeem/kubeadm-ha - this seems to cover what I want to achieve here.
@mbert we started implementing the HA features and chopped wood on the underlying dependency stack now in v1.9, but it's a short cycle for a big task, so the work will continue in v1.10 as you pointed out.
For v1.9, we will document what you're describing here in the official docs though; how to achieve HA with external deps like setting up a LB
Excellent. I am digging through all this right now. I am currently stuck at bootstrapping master 2 and 3, in particular how to configure kubelet and apiserver (how much can I reuse from master 1?) and etcd (I am thinking of using a bootstrap etc on a separate machine for discovery). The guide from the docs is a bit terse when it comes to this.
@mbert I have been following your comments here and I just want to let you know I followed the guide in docs and was able to stand up a working HA k8s cluster using kubeadm (v1.8.x).
If you are following this setup and you need to bootstrap master 2 and 3, you can reuse almost everything from the first master. You then need to fix up the following configuration files on master 2 and 3 to reflect the current host: /etc/kubernetes/manifests/kube-apiserver.yaml, /etc/kubernetes/kubelet.conf, /etc/kubernetes/admin.conf, and /etc/kubernetes/controller-manager.conf
Regarding etcd, if you follow this guide docs you should stand up an external 3-node etcd cluster that spans across the 3 k8s master nodes.
There is also one 'gotcha' item that has NOT yet been covered in the guide docs. You can see this issue for detail: https://github.com/cookeem/kubeadm-ha/issues/6
I also asked a few questions related to kubeadm HA from this post: https://github.com/cookeem/kubeadm-ha/issues/7
I really hope that can give me some thoughts on these.
Thank you in advance for your time.
This is great - definitely need this as I am sure 99% of kubeadm users have a nagging paranoia in the back of their heads about ha of their master(s).
@kcao3 thank you. I will look into this all on coming Monday. So I understand that it is OK to use identical certificates on all three masters?
If yes, I assume that next I'll try will be bring up kubelet and apiserver on master 2 and 3 using the configuration from master 1 (with modified IPs and host names in there of course) and then bootstrap the etcd cluster by putting a modified etcd.yaml into /etc/kubernetes/manifests.
Today I ran into problems because the running etcd on master 1 already had cluster information in its data dir which I had to remove first, but I was still running into problems. I guess some good nights of sleep will be helpful.
Once I've got this running I shall document the whole process and publish it.
@srflaxu40 yep, and in particular if you have an application that indirectly requires apiserver at runtime (legacy application and service discovery in my case) you cannot afford to lose the only master at any time.
I have been able to get replacing the single etcd instance by a cluster in a fresh K8s cluster. The steps are roughly these:
Step 5 is somewhat awkward, and I have found that if I miss the right time here or need too much time to get the other two masters to join (step 6) my cluster gets into a state from which it can hardly recover. When this happened, the simplest solution I found was to shut down kubelet on master 2 and 3, run kubeadm reset on all masters and minions, clear the /var/lib/etcd directories on all masters and set up a new cluster using kubeadm init.
While this works, I'd be interested in possible improvements: Is there any alternative, more elegant and robust approach to this (provided that I still want to follow the approach of running etcd in containers on the masters)?
This comment aims to collect feedback and hints at an early stage. I will post updates on the next steps in a similar way before finally documenting this as a followable guide.
@mbert Why do not you use a independent ETCD cluster instead of creating in the k8s?
@KeithTt Thank you for your feedback. I was thinking about these here:
If an independent etcd cluster's advantages outweigh the above list, I shall be happy to be convinced otherwise.
@mbert Please make sure you sync with @jamiehannaford on this effort, he's also working on this / committed to making these docs a thing in v1.9
@mbert are you available to join our SIG meeting today 9PT or the kubeadm implementation PR tomorrow 9PT? I'd love to discuss this with you in a call :+1:
@luxas actually it was @jamiehannaford who asked me to open this issue. Once I have got things running and documented I hope to get lots of feedback from him. 9PT, that's in an hour, right? That would be fine. Just let me know how to connect with you.
Following guides here and there i manage to do it here is my final steps
/cc @craigtracey
@mbert
Created - not converted - 3 master-node cluster using kubeadm with 3 node etcd cluster deployed on kubernetes
Here's what I needed to do:
Problems:
The way I did it was using kubeadm alpha phase steps, short list follows:
on all master nodes:
on masternode1:
This is really short list of what I did and it can be automated and reproduced in 5 minutes. Also, for me the greatest bonus was I was able to set non-standard pod-network CIDR as I had that restriction of not being able to spare B class IP address range.
If you're interested in more detailed version, please let me know and I'll try and create some docs on how this was done.
@dimitrijezivkovic thank you for your comment. I think it would make sense to put all the relevant information together so that one piece of documentation comes out.
I plan to set up a google docs document and start documenting what I did (which is pretty bare-bones). I would then invite others to join and write extensions, corrections, comments?
I have now "documented" a very simple setup in form of a small ansible project: https://github.com/mbert/kubeadm2ha
It is of course still work in progress, but it already allows to set up a multi-master cluster without any bells and whistles. I have tried to keep it as simple as possible so that by reading one should be able to find out pretty easily what needs to be done in which order.
Tomorrow I will start writing this up as a simple cooking recipe in a google docs document and invite others to collaborate.
Just to call it out explicitly, there's a bunch of orthogonal issues mashed together in the above conversation/suggestions. It might be useful to break these out separately, and perhaps prioritise some above others:
kubeadm upgrade
support for multi-apiserver/cm-scheduler (varies depending on self-hosted vs non-self-hosted)Imo the bare minimum we need is etcd durability (or perhaps availability), and the rest can wait. That removes the "fear" factor, while still requiring some manual intervention to recover from a primary master failure (ie: an active/passive setup of sorts).
I think the details of the rest depend hugely on self-hosted vs "legacy", so I feel like it would simplify greatly if we just decided now to assume self-hosted (or not?) - or we clearly fork the workarounds/docs into those two buckets so we don't confuse readers by chopping and changing.
Aside: One of the challenges here is that just about everything to do with install+upgrade changes if you assume a self-hosted+HA setup (it mostly simplifies everything because you can use rolling upgrades, and in-built k8s machinery). I feel that by continually postponing this setup we've actually made it harder for ourselves to reach that eventual goal, and I worry that we're just going to keep pushing the "real" setup back further while we work on perfecting irrelevant single-master upgrades :( I would rather we addressed the HA setup first, and then worked backwards to try to produce a single-host approximation if required (perhaps by packing duplicate jobs temporarily onto the single host), rather than trying to solve single-host and then somehow think that experience will help us with multi-host.
@mbert I have achieved the HA proposal by generating the certs manually for each node, and without deleting NodeRestriction
, I use haproxy+keepalived
as loadbalancer now, maybe lvs+keepalived
will be better, I will document the details in this weekend, hope to share with u.
FYI all, @mbert has started working on a great WIP guide for kubeadm HA manually that we'll add to the v1.9 kubeadm docs eventually: https://docs.google.com/document/d/1rEMFuHo3rBJfFapKBInjCqm2d7xGkXzh0FpFO0cRuqg/edit
Please take a look at the doc everyone, and provide your comments. We'll soon-ish convert this into markdown and send as a PR to kubernetes/website.
Thank you @mbert and all the others that are active in thread, this will be a great collaboration!
@mbert / @luxas: that doc doesn't allow comments (for me at least :cry:)
Done, I had the wrong setting in the doc.
@mbert I have a question for you. Following your approach, assuming I have a functioning HA k8s cluster. Do you know how to add new k8s masters to my existing cluster? The issue I am facing now is the certs that were generated based on the FIXED number of k8s master hosts at the time the cluster was bootstrapped. This now prevents any new master to join the cluster. From the kubelet's log of the new master, you would see something like this: "... x509: certificate is valid for 192.168.1.x, 192.168.1.y,192.168.1.z not 192.168.1.n." ( where .x,.y,.z are the IP address of the current masters, and .n is the address of the new master). Do you know how to resolve this issue? Do the master nodes must use the same certificates in this case?
@kcao3 I am not very familiar with this particular aspect. Maybe @jamiehannaford can tell you more about this?
@kcao3 Each master join will generate TLS assets using the specific IPv4 for that server. The config also accepts additional SANs, which should include the LB IPv4 which sits in front of the masters. I have a HA guide in review, so check that out if you have time.
I have just pushed a new commit to https://github.com/mbert/kubeadm2ha
@mbert I just read the HA guide from @jamiehannaford : https://github.com/jamiehannaford/kubernetes.github.io/blob/3663090ea9b9a29a00c79dd2916e11737ccf1802/docs/setup/independent/high-availability.md. Is it possible on each of the master node, we can have kubeadm to generate and signed separate certificates using the same CA.crt and CA.key?
So the only things that need to be copied from the primary master to the secondary masters are the CA.crt and CA.key. With this approach, on each master (including primary and secondary), we will run 'kubeadm init' using a generated kubeadm configuration file based on a template like the following:
apiVersion: kubeadm.k8s.io/v1alpha1
kind: MasterConfiguration
kubernetesVersion: v{{ KUBERNETES_VERSION }}
networking:
podSubnet: {{ POD_NETWORK_CIDR }}
api:
advertiseAddress: {{ MASTER_VIP }}
apiServerCertSANs:
- {{ MASTER_VIP }}
etcd:
endpoints:
{% for host in groups['masters'] %}
- http://{{ hostvars[host]['ansible_default_ipv4']['address'] }}:2379
{% endfor %}
If this approach works, it will allow k8s admins to add any new master to their existing multi-masters cluster down the road.
Any thoughts?
@kcao3 That's what I'm trying to do. I figured out I also need to pre-generate proxy CA cert+keys which are different.
But now when I run kubeadm init
on my masters, all components come up properly but the kube-proxy still fails due to authentication issues, even though the front-proxy-client.crt is now signed by the same CA on all nodes.
@discordianfish I also ran into auth issues but when deploying Flannel. Wonder if it's related to what you're seeing.
In the meantime I figured out the the 'proxy CA' (frontend-proxy-*) isn't related to kube-proxy. Still trying to figure out what is going on, it looks though like there is no system:node-proxier
role but I don't know what is suppose to create it.
Since the frontend-proxy stuff was a red herring, I'm starting over on a clean slate now. But would be great if someone could confirm that it should work to create the CA credentials and just run init on all masters? Given the right advertiseAddress, SANs and etcd endpoints of course? Because I'm most worried that kubeadm still somehow generates local secrets other masters don't know about.
When my masters come up, kube-proxy is working first but kube-proxy on the last master fails. When I recreated the pods, all fail. So when running kubeadm init again the same etcd multiple times from different hosts, it somehow breaks the authentication.
The service account looks correct and has a secret:
$ kubectl -n kube-system get ds kube-proxy -o yaml|grep serviceAccount
serviceAccount: kube-proxy
serviceAccountName: kube-proxy
$ kubectl -n kube-system get sa kube-proxy -o yaml|grep -A1 secrets
secrets:
- name: kube-proxy-token-5ll9k
$ kubectl -n kube-system get secret kube-proxy-token-5ll9k
NAME TYPE DATA AGE
kube-proxy-token-5ll9k kubernetes.io/service-account-token 3 16m
This service account is bound to a role too:
$ kubectl get clusterrolebindings kubeadm:node-proxier -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
creationTimestamp: 2017-12-07T12:52:54Z
name: kubeadm:node-proxier
resourceVersion: "181"
selfLink: /apis/rbac.authorization.k8s.io/v1/clusterrolebindings/kubeadm%3Anode-proxier
uid: 8a9638df-db4d-11e7-8d7e-0e580b140468
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node-proxier
subjects:
- kind: ServiceAccount
name: kube-proxy
namespace: kube-system
And the role exist and is looking good:
$ kubectl get clusterrole system:node-proxier -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
rbac.authorization.kubernetes.io/autoupdate: "true"
creationTimestamp: 2017-12-07T12:52:51Z
labels:
kubernetes.io/bootstrapping: rbac-defaults
name: system:node-proxier
resourceVersion: "63"
selfLink: /apis/rbac.authorization.k8s.io/v1/clusterroles/system%3Anode-proxier
uid: 88dfc662-db4d-11e7-8d7e-0e580b140468
rules:
- apiGroups:
- ""
resources:
- endpoints
- services
verbs:
- list
- watch
- apiGroups:
- ""
resources:
- nodes
verbs:
- get
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- update
So not sure what is going one. From how I understand everything, this should work like this but the apiserver keeps logging: E1207 13:18:20.697707 1 authentication.go:64] Unable to authenticate the request due to an error: [invalid bearer token, [invalid bearer token, crypto/rsa: verification error]]
Okay, so it looks like the token is only accepted by one instance of my apiservers, probably on the master where kubeadm init
last ran. I thought the service account tokens get stored in etcd?
Mystery solved thanks for gintas and foxie in #kubernetes-users: We also need to pre-generate the sa keys and distribute them along with the CA.
I followed @jamiehannaford's HA guide fairly closely and eventually reached a working HA cluster (set up in a Vagrant setting with a HAProxy load-balancer fronting three master nodes), but I hit a few obstacles along the way and thought I'd share them here since they are probably relevant irrespective of approach:
It is important that the etcd version is compatible with the Kubernetes version your running. From what I can gather the guide targets k8s 1.9
and therefore uses etcd v3.1.10
. For a k8s 1.8
installation (which I was targeting) , you should use v3.0.17
(using v3.1.17
caused kubeadm
to choke, failing to extract the etcd version).
I had to run etcd using systemd, since running it as a static pods under /etc/kubernetes/manifests
would cause kubeadm
preflight checks to fail (it expects that directory to be empty).
Before running kubeadm init
on master1 and master2, you need to wait for master0 to generate certificates and, in addition to /etc/kubernetes/pki/ca.{crt,key}
, copy the /etc/kubernetes/pki/sa.key
and /etc/kubernetes/pki/sa.pub
files to master1 and master2 (as hinted by @discordianfish). Otherwise, master1 and master2 will generate service account token signing certificates of their own, which in my case caused kube-proxy on those hosts to fail to authenticate against the apiserver.
There are also the files front-proxy-ca.{crt,key}
and front-proxy-client.{crt,key}
which I did not copy. I'm unsure if they should have been copied from master0 as well, but things appear to be working anyway.
The "regular" kubeadm installation guide encourages you to configure Docker to use the systemd cgroup driver. For me, that also required me to pass --cgroup-driver=systemd
to the kubelet via KUBELET_EXTRA_ARGS
.
@petergardfjall Ha it's funny to see how you run into exactly the same issues. So yeah as of yesterday my multi-HA cluster also works. I ran into https://github.com/kubernetes/kubeadm/issues/590 though, did you find a nice solution for that? Didn't had to use a special etcd version. I think I'm just using the defaults in coreos' stable etcd-wrapper. Regarding the front-proxy stuff.. I frankly have no idea what it is.
@discordianfish: I did not run into #590 . I used a kubeadm config file with
api:
advertiseAddress: <apiserver-loadbalancer-ip>
and it appears to have been picked up by the kube-proxy
config map.
> kubectl get cm -n kube-system kube-proxy -o yaml
apiVersion: v1
data:
kubeconfig.conf: |
apiVersion: v1
kind: Config
clusters:
- cluster:
certificate-authority: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server: https://<apiserver-loadbalancer-ip>:6443
name: default
Ah okay. Right, it works with a load balancer ip but you don't get a stable IP when running on AWS and using an ELB, so need to use a name.
@discordianfish I see, that may actually become a problem since I'm planning on running it in AWS later on. How did you work around that?
@jamiehannaford in the HA guide you make references to using cloud-native loadbalancers. Did you experiment with that? Did you manage to get around #590?
No, haven't found a solution yet. Right now it's just a note in my docs to edit this config map manually.
And I just shot me in the foot with this: kubeadm init on a new master will overwrite the configmap and https://github.com/kubernetes/kubernetes/issues/57109 makes it even harder to realize this.
So from what I can tell there is no way to use kubeadm right now in a multi-master setup, without falling back to executing alpha phases
manually.
@jamiehannaford's HA guide misses this in general. A cluster created like this will have the IP of a single master hardcoded and breaks ones this goes away.
Hello
I just experimented a bit with this and I think I have a working setup now. So here is what I did:
The experiment was performed on DigtialOcean with 4x 20$ droplets (3 master + 1 worker)
First I created 3 droplet (CoreOS stable):
master1: 188.166.76.108
master2: 188.166.29.53
master3: 188.166.76.133
I then rain the following script on every node to configure the needed pieces to use kubeadm with CoreOS:
#!/bin/bash
set -o nounset -o errexit
RELEASE="$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)"
CNI_VERSION="v0.6.0"
mkdir -p /opt/bin
cd /opt/bin
curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/amd64/{kubeadm,kubelet,kubectl}
chmod +x {kubeadm,kubelet,kubectl}
mkdir -p /opt/cni/bin
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-amd64-${CNI_VERSION}.tgz" | tar -C /opt/cni/bin -xz
BRANCH="release-$(cut -f1-2 -d .<<< "${RELEASE##v}")"
cd "/etc/systemd/system/"
curl -L "https://raw.githubusercontent.com/kubernetes/kubernetes/${BRANCH}/build/debs/kubelet.service" | sed 's:/usr/bin:/opt/bin:g' > kubelet.service
mkdir -p "/etc/systemd/system/kubelet.service.d"
cd "/etc/systemd/system/kubelet.service.d"
curl -L "https://raw.githubusercontent.com/kubernetes/kubernetes/${BRANCH}/build/debs/10-kubeadm.conf" | sed 's:/usr/bin:/opt/bin:g' > 10-kubeadm.conf
Create the initial master:
core@master-01 ~ $ sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-cert-extra-sans="127.0.0.1,188.166.76.108,188.166.29.53,188.166.76.133"
[...]
kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
[...]
core@master-01 ~ $ sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf apply -f https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml
core@master-01 ~ $ sudo systemctl enable kubelet docker
Next we need to create a etcd cluster, so change the etcd manifest so etcd listen for peers on all interfaces (WARNING: This isn't safe, in production you should at least use TLS for peer authentication/communication)
core@master-01 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# add --listen-peer-urls=http://0.0.0.0:2380 as a command arg
core@master-01 ~ $ sudo systemctl restart kubelet # for some reason, kubelet does not pick up the change
Change the default etcd member peer-url to the public ipv4 ip:
core@master-01 ~ $ ETCDCTL_API=3 etcdctl member list
8e9e05c52164694d, started, default, http://localhost:2380, http://127.0.0.1:2379
core@master-01 ~ $ ETCDCTL_API=3 etcdctl member update 8e9e05c52164694d --peer-urls="http://188.166.76.108:2380"
Now copy all the kubernetes files (manifests/pki) to the other master nodes:
$ eval $(ssh-agent)
$ ssh-add <path to ssh key>
$ ssh -A core@188.166.29.53 # master-02
core@master-02 ~ $ sudo -E rsync -aP --rsync-path="sudo rsync" core@188.166.76.108:/etc/kubernetes/ /etc/kubernetes
$ ssh -A core@188.166.76.133 # master-03
core@master-03 ~ $ sudo -E rsync -aP --rsync-path="sudo rsync" core@188.166.76.108:/etc/kubernetes/ /etc/kubernetes
Add master-02 to the etcd cluster:
core@master-01 ~ $ ETCDCTL_API=3 etcdctl member add member-02 --peer-urls="http://188.166.29.53:2380"
Member b52af82cbbc8f30 added to cluster cdf818194e3a8c32
ETCD_NAME="member-02"
ETCD_INITIAL_CLUSTER="member-02=http://188.166.29.53:2380,default=http://188.166.76.108:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
$ ssh core@188.166.29.53 # master-02
core@master-02 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# Add the following as args:
--name=member-02
--initial-cluster=member-02=http://188.166.29.53:2380,default=http://188.166.76.108:2380
--initial-cluster-state=existing
core@master-02 ~ $ sudo systemctl restart kubelet
Add master-03 to the etcd cluster:
core@master-01 ~ $ ETCDCTL_API=3 etcdctl member add master-03 --peer-urls="http://188.166.76.133:2380"
Member 874cba873a1f1e81 added to cluster cdf818194e3a8c32
ETCD_NAME="master-03"
ETCD_INITIAL_CLUSTER="member-02=http://188.166.29.53:2380,master-03=http://188.166.76.133:2380,default=http://188.166.76.108:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
$ ssh core@188.166.76.133 # master-03
core@master-03 ~ $ sudo vi /etc/kubernetes/manifests/etcd.yaml
# Add the following as args:
--name=master-03
--initial-cluster=member-02=http://188.166.29.53:2380,master-03=http://188.166.76.133:2380,default=http://188.166.76.108:2380
--initial-cluster-state=existing
core@master-03 ~ $ sudo systemctl start kubelet
So now we should have a 3-node etcd cluster.
Now lets master-02 and master-03 join the k8s cluster:
$ ssh core@188.166.29.53 # master-02
core@master-02 ~ $ sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf
core@master-02 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
$ ssh core@188.166.76.133 # master-03
core@master-03 ~ $ sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf
core@master-03 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 188.166.76.108:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
Mark them as masters:
core@master-01 ~ $ sudo kubeadm alpha phase mark-master --node-name master-02
core@master-01 ~ $ sudo kubeadm alpha phase mark-master --node-name master-03
Change kubelet,kube-scheduler and kube-controller-manager to use the local apiserver instead of master-01 apiserver:
core@master-01 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}
core@master-02 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}
core@master-03 ~ $ sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{scheduler.conf,kubelet.conf,controller-manager.conf}
Change kube-apiserver yaml file to advertise the correct ip and health checking ip:
core@master-02 ~ $ sudo sed 's/188.166.76.108/188.166.29.53/g' -i /etc/kubernetes/manifests/kube-apiserver.yaml
core@master-03 ~ $ sudo sed 's/188.166.76.108/188.166.76.133/g' -i /etc/kubernetes/manifests/kube-apiserver.yaml
Enable kubelet, docker and reboot:
core@master-01 ~ $ sudo systemctl enable kubelet docker; sudo reboot
core@master-02 ~ $ sudo systemctl enable kubelet docker; sudo reboot
core@master-03 ~ $ sudo systemctl enable kubelet docker; sudo reboot
Change kube-proxy to use the apiserver on localhost:
core@master-01 ~ $ sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf -n kube-system edit configmap kube-proxy
# Change server: https://<ip>:6443 to https://127.0.0.1:6443
Now lets try adding a worker node (run the script at the top): worker-01: 178.62.216.244
$ ssh core@178.62.216.244
core@worker-01 ~ $ sudo iptables -t nat -I OUTPUT -p tcp -o lo --dport 6443 -j DNAT --to 188.166.76.108
core@worker-01 ~ $ sudo iptables -t nat -I POSTROUTING -o eth0 -j SNAT --to-source $(curl -s ipinfo.io | jq -r .ip)
core@worker-01 ~ $ sudo sysctl net.ipv4.conf.eth0.route_localnet=1
core@worker-01 ~ $ sudo kubeadm join --token b11224.fada30ef8a7cbd38 127.0.0.1:6443 --discovery-token-ca-cert-hash sha256:19d34ff6e69203a799ab5984a212684b3dcd446ca5e9d6f6c1a8ae422583b62a
core@worker-01 ~ $ sudo systemctl enable kubelet docker
Now we just need to add a local loadbalancer to the worker node, and everything is done.
Save the following as /etc/nginx/nginx.conf
on the worker-01 node:
error_log stderr notice;
worker_processes auto;
events {
use epoll;
worker_connections 1024;
}
stream {
upstream kube_apiserver {
least_conn;
server 188.166.76.108:6443 max_fails=3 fail_timeout=30s;
server 188.166.29.53:6443 max_fails=3 fail_timeout=30s;
server 188.166.76.133:6443 max_fails=3 fail_timeout=30s;
}
server {
listen 127.0.0.1:6443 reuseport;
proxy_pass kube_apiserver;
proxy_timeout 10m;
proxy_connect_timeout 1s;
}
}
Create /etc/kubernetes/manifests
core@worker-01 ~ $ sudo mkdir /etc/kubernetes/manifests
Add a static nginx-proxy
manifest as /etc/kubernetes/manifests/nginx-proxy.yaml
:
apiVersion: v1
kind: Pod
metadata:
name: nginx-proxy
namespace: kube-system
labels:
k8s-app: kube-nginx
spec:
hostNetwork: true
containers:
- name: nginx-proxy
image: nginx:1.13-alpine
imagePullPolicy: Always
resources:
limits:
cpu: 200m
memory: 128M
requests:
cpu: 50m
memory: 32M
volumeMounts:
- mountPath: /etc/nginx
name: etc-nginx
readOnly: true
volumes:
- name: etc-nginx
hostPath:
path: /etc/nginx
Reboot the node and the temporary iptables rules should be gone, and everything should work as expected.
A long post, but it shows that it is doable :)
Edit: Forgot to change the API server for the worker node: sudo sed 's/188.166.76.108/127.0.0.1/g' -i /etc/kubernetes/{bootstrap-kubelet.conf,kubelet.conf}
Edit2: Should also change kubectl --kubeconfig=admin.conf -n kube-public get configmap cluster-info
@klausenbusk Great :tada:! If you want to carry/improve https://github.com/kubernetes/website/pull/6458, feel free to send a PR with more details on what you did to help @jamiehannaford which is on vacation at the moment.
@klausenbusk , on the master-02 and master-03 , I don't understand how you were able to join? Since the /etc/kubernetes directory is not empty. Can you please clarify if there is a step missing? Thanks.
@klausenbusk , on the master-02 and master-03 , I don't understand how you were able to join? Since the /etc/kubernetes directory is not empty. Can you please clarify if there is a step missing?
I did remove sudo rm /etc/kubernetes/pki/ca.crt /etc/kubernetes/kubelet.conf
as documented, removing the whole directory wasn't needed.
To @discordianfish and others wanting to run a HA setup on AWS.
I did manage to get a HA setup to work with Amazon's ELB (despite it not having a single static IP address).
To get it to work, the following steps (in addition to @jamiehannaford's HA guide) need to be taken:
Since the ELB does not have a static IP address, we cannot use that as the apiserver advertise address. Instead, we let each master advertise its own private IP address.
The down-side of this approach seems to be that apiservers will "fight" over the endpoint record, rewriting it every now and then (as can be seen via kubectl get endpoints
) which, in turn, has consequences for kube-proxy, which will rewrite its iptables whenever a change is detected.
This doesn't appear to harm the correctness of Kubernetes, but I guess it can lead to some performance degradation in large clusters. Any thoughts?
The issue is discussed in greater detail here.
All worker kubelets and kube-proxies need to access the API servers via the load-balancers FQDN. Since kubeadm
doesn't allow us to specify different servers for kube-proxy and worker kubelets (they will simply use the IP address of the apiserver that they happended to connect to at kubeadm join
)
we need to take care of this ourselves.
The kube-proxy
configuration is stored as a configmap, which gets overwritten every time kubeadm init
is run (once for every master node). Therefore, for each kubeadm init
we need to patch the configmap as follows:
kubectl get configmap -n kube-system kube-proxy -o yaml > kube-proxy.cm
sudo sed -i 's#server:.*#server: https://
kubectl delete pod -n kube-system -l k8s-app=kube-proxy
On each worker we need to patch the kubelet
configuration after join, so that the kubelet connects via the load-balancer.
sudo kubeadm join --config=kubeadm-config.yaml
wait_for 60 [ -f /etc/kubernetes/kubelet.conf ]
sudo sed -i 's#server:.*#server: https://
With this approach I seem to have a working cluster where one master at a time can go down without (apiserver) service disruption.
This doesn't appear to harm the correctness of Kubernetes, but I guess it can lead to some performance degradation in large clusters. Any thoughts?
You can switch to the new lease reconciler in 1.9, it should fix the "fighting" over the endpoint issue.
Excellent advice @klausenbusk. It worked like a charm.
@petergardfjall
they will simply use the IP address of the apiserver that they happended to connect to at kubeadm join
What happens if you do kubeadm join
with the LB's IP?
In terms of the kubelet, I think that's a necessary manual edit. Need to add to HA guide.
The planned HA features in kubeadm are not going to make it into v1.9 (see #261). So what can be done to make a cluster setup by kubeadm sufficiently HA?
This is what it looks like now:
Hence an active/active or active/passive master setup needs to be created (i.e. mimic what kubeadm would supposedly be doing in the futue):
This seems achievable if converting the existing master instance to a cluster of masters (2) can be done (the Kubernetes guide for building HA clusters seems to indicate so). Active/active would be not more expensive than active/passive.
I am currently working on this. If I succeed I shall share what I find out here.