kubeadm upgrade diff looses configuration options

ieugen commented 6 years ago

What keywords did you search in kubeadm issues before filing this one?

diff, upgrade

Is this a BUG REPORT or FEATURE REQUEST?

Choose one: BUG REPORT

Versions

kubeadm version (use kubeadm version):

kubeadm version 
kubeadm version: &version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:50:16Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version (use kubectl version):

k version 
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:46:00Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.6", GitCommit:"6260bb08c46c31eea6cb538b34a9ceb3e406689c", GitTreeState:"clean", BuildDate:"2017-12-21T06:23:29Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: Bare metal (VM's in cloud)

OS (e.g. from /etc/os-release):

PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"

Kernel (e.g. uname -a): Linux s03 4.9.0-6-amd64 #1 SMP Debian 4.9.88-1+deb9u1 (2018-05-07) x86_64 GNU/Linux
Others:

kubeadm init \
  --pod-network-cidr=192.168.0.0/16 \
  --apiserver-advertise-address=10.20.0.100 \
  --apiserver-cert-extra-sans=XXXXX,XXXX

What happened?

I'm planning the upgrade 1.11.0 -> 1.11.1 . I upgraded deb packages for all nodes in cluster and then I did kubeadm upgrade diff to see the differences. I've noticed some configuration options change in a way that will break the cluster, and some I don't know about:

the advertise ip address changes from my VPN ip to the public IP address -> network will break most likely
the oidc configuration options get lost -> dashboard SSO is lost
the certificates locations get changed -> have not idea what is going to happen

What you expected to happen?

Upgrade to be performed with minimal/no configuration changes.

How to reproduce it (as minimally and precisely as possible)?

Make 1.11 cluster with oidc values and custom advertise IP and then try to upgrade.

Anything else we need to know?

You are awesome ! :)

kubeadm upgrade diff 
--- /etc/kubernetes/manifests/kube-scheduler.yaml
+++ new manifest
@@ -16,7 +16,7 @@
     - --address=127.0.0.1
     - --kubeconfig=/etc/kubernetes/scheduler.conf
     - --leader-elect=true
-    image: k8s.gcr.io/kube-scheduler-amd64:v1.11.0
+    image: k8s.gcr.io/kube-scheduler-amd64:v1.11.1
     imagePullPolicy: IfNotPresent
     livenessProbe:
       failureThreshold: 8
--- /etc/kubernetes/manifests/kube-apiserver.yaml
+++ new manifest
@@ -14,7 +14,7 @@
   - command:
     - kube-apiserver
     - --authorization-mode=Node,RBAC
-    - --advertise-address=10.20.0.100
+    - --advertise-address=REDACTED
     - --allow-privileged=true
     - --client-ca-file=/etc/kubernetes/pki/ca.crt
     - --disable-admission-plugins=PersistentVolumeLabel
@@ -40,15 +40,12 @@
     - --service-cluster-ip-range=10.96.0.0/12
     - --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
     - --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
-    - --oidc-issuer-url=https://auth.REDACTED/auth/realms/gpi-infra
-    - --oidc-client-id=kubernetes
-    - --oidc-groups-claim=groups
-    image: k8s.gcr.io/kube-apiserver-amd64:v1.11.0
+    image: k8s.gcr.io/kube-apiserver-amd64:v1.11.1
     imagePullPolicy: IfNotPresent
     livenessProbe:
       failureThreshold: 8
       httpGet:
-        host: 10.20.0.100
+        host: REDACTED
         path: /healthz
         port: 6443
         scheme: HTTPS
--- /etc/kubernetes/manifests/kube-controller-manager.yaml
+++ new manifest
@@ -14,18 +14,15 @@
   - command:
     - kube-controller-manager
     - --address=127.0.0.1
-    - --allocate-node-cidrs=true
-    - --cluster-cidr=192.168.0.0/16
     - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
     - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
     - --controllers=*,bootstrapsigner,tokencleaner
     - --kubeconfig=/etc/kubernetes/controller-manager.conf
     - --leader-elect=true
-    - --node-cidr-mask-size=24
     - --root-ca-file=/etc/kubernetes/pki/ca.crt
     - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
     - --use-service-account-credentials=true
-    image: k8s.gcr.io/kube-controller-manager-amd64:v1.11.0
+    image: k8s.gcr.io/kube-controller-manager-amd64:v1.11.1
     imagePullPolicy: IfNotPresent
     livenessProbe:
       failureThreshold: 8
@@ -41,6 +38,15 @@
       requests:
         cpu: 200m
     volumeMounts:
+    - mountPath: /usr/local/share/ca-certificates
+      name: usr-local-share-ca-certificates
+      readOnly: true
+    - mountPath: /etc/ca-certificates
+      name: etc-ca-certificates
+      readOnly: true
+    - mountPath: /etc/kubernetes/pki
+      name: k8s-certs
+      readOnly: true
     - mountPath: /etc/ssl/certs
       name: ca-certs
       readOnly: true
@@ -52,22 +58,9 @@
     - mountPath: /usr/share/ca-certificates
       name: usr-share-ca-certificates
       readOnly: true
-    - mountPath: /usr/local/share/ca-certificates
-      name: usr-local-share-ca-certificates
-      readOnly: true
-    - mountPath: /etc/ca-certificates
-      name: etc-ca-certificates
-      readOnly: true
-    - mountPath: /etc/kubernetes/pki
-      name: k8s-certs
-      readOnly: true
   hostNetwork: true
   priorityClassName: system-cluster-critical
   volumes:
-  - hostPath:
-      path: /usr/local/share/ca-certificates
-      type: DirectoryOrCreate
-    name: usr-local-share-ca-certificates
   - hostPath:
       path: /etc/ca-certificates
       type: DirectoryOrCreate
@@ -92,5 +85,9 @@
       path: /usr/share/ca-certificates
       type: DirectoryOrCreate
     name: usr-share-ca-certificates
+  - hostPath:
+      path: /usr/local/share/ca-certificates
+      type: DirectoryOrCreate
+    name: usr-local-share-ca-certificates
 status: {}

luxas commented 6 years ago

/assign @liztio @timothysc

timothysc commented 6 years ago

@ieugen I'd recommend using the configuration migrate utility prior to attempting to upgrade. The configuration file format has significantly changed from v1.10 -> v1.11 but folks have done a good job in testing that migration.

ieugen commented 6 years ago

@timothysc I've installed 1.11 and I am upgrading to 1.11.1 so there should not be much to upgrade. I did use the utility and I got there results:

kubeadm config view > kubeadm-old.yaml
kubeadm config migrate --old-config kubeadm-old.yaml > kubeadm-new.yaml
diff kubeadm-old.yaml kubeadm-new.yaml 

10d9
<   oidc-issuer-url: https://REDACTED
12a12
>   oidc-issuer-url: https://REDACTED
17a18,25
> bootstrapTokens:
> - groups:
>   - system:bootstrappers:kubeadm:default-node-token
>   token: REDACTED
>   ttl: 24h0m0s
>   usages:
>   - signing
>   - authentication
137c145,150
< nodeRegistration: {}
---
> nodeRegistration:
>   criSocket: /var/run/dockershim.sock
>   name: m01
>   taints:
>   - effect: NoSchedule
>     key: node-role.kubernetes.io/master

wizard580 commented 6 years ago

Confirming. In my case (1.11.0 -> 1.11.1) it looses apiServerExtraArgs like etcd-cafile, feature-gates etc... and replaces them with some defaults.

I can find expected values inside of configmap (key: MasterConfiguration) like this kubectl get configmap -n kube-system kubeadm-config -oyaml

ieugen commented 6 years ago

I've made the upgrade and it wen't smooth so I am a bit confused about this. I also rebooted the cluster (one node at a time, starting with master) to see if there are any issues and I did not see any.

I don't remember having to change anything after the upgrade and I did not document it :(.

Regards,

mkretzer commented 6 years ago

We lost networking to the pods after 1.11.1 upgrade from 1.10.6. It looks like --cluster-cidr is no longer working as all our pods came up with IPs from 172.17.x.x and not 10.244.x.x which is configured for flanel. How can we resolve this situation?

wizard580 commented 6 years ago

Even better: ATM I'm at v1.11.0 kubeadm upgrade diff v1.11.0 gives me same broken result.

--- /etc/kubernetes/manifests/kube-apiserver.yaml
+++ new manifest
@@ -14,17 +14,16 @@
   - command:
     - kube-apiserver
     - --authorization-mode=Node,RBAC
-    - --etcd-cafile=/opt/etcd/ca.pem
-    - --etcd-certfile=/opt/etcd/staging-cluster2node.pem
-    - --etcd-keyfile=/opt/etcd/staging-cluster2node-key.pem
-    - --feature-gates=PodPriority=false
     - --advertise-address=192.168.6.161
     - --allow-privileged=true
     - --client-ca-file=/etc/kubernetes/pki/ca.crt
     - --disable-admission-plugins=PersistentVolumeLabel
     - --enable-admission-plugins=NodeRestriction
     - --enable-bootstrap-token-auth=true
-    - --etcd-servers=https://192.168.6.161:2379,https://192.168.6.162:2379,https://192.168.6.163:2379
+    - --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
+    - --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
+    - --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
+    - --etcd-servers=https://127.0.0.1:2379
     - --insecure-port=0
     - --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
....

ieugen commented 6 years ago

@wizard580 In my case the upgrade went ok. No issues with the cluster (and I'm also running on top of wireguard VPN)

kubeadm upgrade apply v1.11.1 worked ok
kubeadm config diff shows same bad diff like in your case

wizard580 commented 6 years ago

I'll try tomorrow after backup. But anyway broken diff is a bug. From my perspective - major.

mkretzer commented 6 years ago

For us this is not only broken diff as node cidr really seems to get lost.

wizard580 commented 6 years ago

Upgrade with kubeadm upgrade apply v1.11.1 worked fine, configs are not broken as far as I can see. It generated unneeded etcd certs, but they are ignored by our configs/setup

mkretzer commented 6 years ago

For us upgrade also did not seem broken at first but after uncordoning the upgraded nodes and draining the old nodes our application went down right away because the pods all used wrong IPs.

wizard580 commented 6 years ago

Confirming. Found similar issues... in our case IPVS was stuck at old service:pods mappings. You can check for kube-proxy logs and probably you'll find a lot of errors about ipset. Reboot (of nodes) helped us. Observing...

liztio commented 6 years ago

Can still reproduce this in the latest v1.12.0 alpha. Gonna see if I can if I can't sort this out for the code freeze.

timothysc commented 6 years ago

ETOOCOMPLICATED, punting to 1.13

ieugen commented 6 years ago

Some updates,

I've made the upgrade to 1.11.2 and 1.11.3 without any issue. Every time I performed the upgrade the diff showed it was dropping the information however that does not seem to happen. At this point I believe it is bad reporting.

mkretzer commented 6 years ago

@ieugen Minor Upgrades were also not affected here, but every major upgrade (1.10.x -> 1.11.x) was!

Brightside56 commented 6 years ago

We lost networking to the pods after 1.11.1 upgrade from 1.10.6. It looks like --cluster-cidr is no longer working as all our pods came up with IPs from 172.17.x.x and not 10.244.x.x which is configured for flanel. How can we resolve this situation?

@mkretzer In my case worker nodes kubelet loses its network parameters during upgrades, my personal fix is echo "KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni" > /var/lib/kubelet/kubeadm-flags.env

neolit123 commented 6 years ago

since 1.11, /var/lib/kubelet/kubeadm-flags.env is a file that kubeadm init and join generate automatically on runtime each time: https://kubernetes.io/docs/setup/independent/kubelet-integration/#the-kubelet-drop-in-file-for-systemd

if you write it:

before init/join, kubeadm will overwrite it and discard it's contents.
after init/join, kubeadm or the kubelet will not use it.

Brightside56 commented 6 years ago

since 1.11, /var/lib/kubelet/kubeadm-flags.env is a file that kubeadm init and join generate automatically on runtime each time: https://kubernetes.io/docs/setup/independent/kubelet-integration/#the-kubelet-drop-in-file-for-systemd

if you write it:
* before `init/join`, kubeadm will overwrite it and discard it's contents.

* after `init/join`, kubeadm or the kubelet will not use it.

It's great, but kubeadm init/join wasn't run during cluster upgrade and cgroup/cni args were lost on worker nodes, that's why pods had 172.0.0.x IPs

neolit123 commented 6 years ago

It's great, but kubeadm init/join wasn't run during cluster upgrade and cgroup/cni args were lost on worker nodes, that's why pods had 172.0.0.x IPs

that makes the issue valid.

rdodev commented 5 years ago

On it.

mkretzer commented 5 years ago

@mkretzer In my case worker nodes kubelet loses its network parameters during upgrades, my personal fix is echo "KUBELET_KUBEADM_ARGS=--cgroup-driver=cgroupfs --network-plugin=cni" > /var/lib/kubelet/kubeadm-flags.env

That helped, thank you very much! For all our clusters: Its upgrade time! :-)

adoerler commented 5 years ago

@neolit123

that makes the issue valid.

I've added my notes about this issue (upgrading the cluster) over here: https://github.com/kubernetes/kubeadm/issues/1347#issuecomment-456739287

neolit123 commented 5 years ago

@adoerler it seems like the unit file issue you outlined here is a separate one: https://github.com/kubernetes/kubeadm/issues/1347#issuecomment-456739287

but you are right, we do recommend to use package managers in recent versions and by using a package manager a unit file will be updated as well. i guess that was a problem in the ->1.12 upgrade doc.

neolit123 commented 5 years ago

as far as this issue goes we are pushing a fix for a certain bug in our library for DIFF: https://github.com/kubernetes/kubernetes/pull/73941

but this will only land in 1.14 and cannot be backported to older releases.

i'm going to have to close this issue, but if anyone finds a problem related to DIFF in 1.13 -> 1.14 upgrades please feel free to open a new ticket.

kubernetes / kubeadm