kubernetes / kubeadm

Aggregator for issues filed against kubeadm
Apache License 2.0
3.77k stars 717 forks source link

Kubeadm update to 1.10 fails on ha k8s/etcd cluster #837

Closed brokenmass closed 6 years ago

brokenmass commented 6 years ago

BUG REPORT

Versions

kubeadm version: 1.10.2

Environment:

What happened?

A couple of months ago I created a kubernetes 1.9.3 HA cluster using kubeadm 1.9.3, following the 'official' documentation https://kubernetes.io/docs/setup/independent/high-availability/ , setting up the etcd HA cluster hosting it on the master nodes using static pods

I wanted to upgrade my cluster to k8s 1.10.2 using the latest kubeadm; after updating kubeadm, when running kubeadm upgrade plan, I got the following error:

[root@shared-cob-01 tmp]# kubeadm upgrade plan
[preflight] Running pre-flight checks.
[upgrade] Making sure the cluster is healthy:
[upgrade/config] Making sure the configuration is correct:
[upgrade/config] Reading configuration from the cluster...
[upgrade/config] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[upgrade/plan] computing upgrade possibilities
[upgrade] Fetching available versions to upgrade to
[upgrade/versions] Cluster version: v1.9.3
[upgrade/versions] kubeadm version: v1.10.2
[upgrade/versions] Latest stable version: v1.10.2
[upgrade/versions] FATAL: context deadline exceeded

I investigate the issue and found the 2 root causes:

1) kubeadm doesn't identify etcd cluster as TLS enabled

The guide instruct to use the following command in the etcd static pod

- etcd --name <name> \
  - --data-dir /var/lib/etcd \
  - --listen-client-urls http://localhost:2379 \
  - --advertise-client-urls http://localhost:2379 \
  - --listen-peer-urls http://localhost:2380 \
  - --initial-advertise-peer-urls http://localhost:2380 \
  - --cert-file=/certs/server.pem \
  - --key-file=/certs/server-key.pem \
  - --client-cert-auth \
  - --trusted-ca-file=/certs/ca.pem \
  - --peer-cert-file=/certs/peer.pem \
  - --peer-key-file=/certs/peer-key.pem \
  - --peer-client-cert-auth \
  - --peer-trusted-ca-file=/certs/ca.pem \
  - --initial-cluster etcd0=https://<etcd0-ip-address>:2380,etcd1=https://<etcd1-ip-address>:2380,etcd2=https://<etcd2-ip-address>:2380 \
  - --initial-cluster-token my-etcd-token \
  - --initial-cluster-state new

kubeadm >= 1.10 checks (here: https://github.com/kubernetes/kubernetes/blob/release-1.10/cmd/kubeadm/app/util/etcd/etcd.go#L56) if etcd has TLS enabled by checking the presence of the following flags in the static pod command.

"--cert-file=",
"--key-file=",
"--trusted-ca-file=",
"--client-cert-auth=",
"--peer-cert-file=",
"--peer-key-file=",
"--peer-trusted-ca-file=",
"--peer-client-cert-auth=",

but as the flags --client-cert-auth and --peer-client-cert-auth are used in the instructions without any parameter (being booleans) kubeadm didn’t recognise the etcd cluster to have TLS enabled.

PERSONAL FIX: I updated my etcd static pod command to use - --client-cert-auth=true and - --peer-client-cert-auth=true

GENERAL FIX: Update the instructions to use --client-cert-auth=true and --peer-client-cert-auth=true and relax kubeadm checks using "--peer-cert-file" and"--peer-key-file" (without the equals)

2) kubeadm didn't use the correct certificates

after fixing point 1, the problem still persisted as kubeadm was not using the right certificates. By following the kubeadm HA guide, in fact, the created certificates are ca.pem ca-key.pem peer.pem peer-key.pem client.pem client-key.pem but the latest kubeadm expects ca.crt ca.key``peer.crt peer.key``healthcheck-client.crt healthcheck-client.key instead. Yhe kubeadm-config MasterConfiguration keys etcd.caFile, etcd.certFile and etcd.keyFile are ignored.

PERSONAL FIX: Renamed .pem certificate to their .crt and .key equivalent and updated the etcd static pod configuration to use them.

GENERAL FIX: Use the kubeadm-config data.caFile, data.certFile and data.keyFile values, infer the right certificates from etcd static pod definition (pod path + volumes hostPath) and/or create a new temporary client certificate to use during the upgrade.

What you expected to happen?

The upgrade plan should have been executed correctly

How to reproduce it (as minimally and precisely as possible)?

create a k8s ha cluster using kubeadm 1.9.3 following https://kubernetes.io/docs/setup/independent/high-availability/ and try to update it to k8s >= 1.10 using the latest kubeadm

brokenmass commented 6 years ago

this issue seems to be fixed in kubeadm 1.10.3, even though it will not automatically update the static etcd pod as it recognise it as 'external'

FloMedja commented 6 years ago

I am using kubeadm 1.10.3 and have the same issues . My cluster is 1.10.2 with an external secure etcd

FloMedja commented 6 years ago

@brokenmass Does the values for your personnal fixes to the second cause you notice look like this :

  caFile: /etc/kubernetes/pki/etcd/ca.crt
  certFile: /etc/kubernetes/pki/etcd/healthcheck-client.crt
  keyFile: /etc/kubernetes/pki/etcd/healthcheck-client.key
FloMedja commented 6 years ago

@detiber can you help please ?

brokenmass commented 6 years ago

@FloMedja in my case the values looks like :

  caFile: /etc/kubernetes/pki/etcd/ca.pem
  certFile: /etc/kubernetes/pki/etcd/client.pem
  keyFile: /etc/kubernetes/pki/etcd/client-key.pem

and 1.10.3 is working correctly

FloMedja commented 6 years ago

@brokenmass So with kubeadm 1.10.3 everything work without no need of your personals fixes. In this case i am little confused. I have kubeadm 1.10.3 but the same error message that you mention in this bug report. I will double check my config may be i make some mistakes elsewhere

brokenmass commented 6 years ago

add here (or join kubernetes slack and send me a direct message) your kubeadm-config, etcd static pods yml and the full output of kubeadm upgrade plan

detiber commented 6 years ago

My apologies, I'm just now seeing this. @chuckha did the original work for the static-pod HA etcd docs, I'll work with him over the next couple of days to see if we can help straighten out the HA upgrades.

FloMedja commented 6 years ago

@detiber thanks you. the upgrade plan finally work. but i face some race conditions issues when tries to upgrade the cluster. sometime it work sometimes i hae the same error as kubernetes/kubeadm/issues/850 . kubeadm run into race condition when try to restart a pod on one node.

detiber commented 6 years ago

I ran into some snags getting a test env setup for this today and I'm running out of time before my weekend starts. I'll pick back up on this early next week.

timothysc commented 6 years ago

/assign @chuckha @detiber

luxas commented 6 years ago

@chuckha @detiber @stealthybox any update on this?

timothysc commented 6 years ago

So 1.9->1.10 HA upgrade was not a supported or vetted path.

We are currently in progress on updating our maintain our docs for 1.11->1.12 which we do plan to maintain going forwards.