kubeadm upgrade on arm from 1.8.5 -> 1.9.0 fails

brendandburns commented 6 years ago

What keywords did you search in kubeadm issues before filing this one?

upgrade, TLS

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): 1.9.0

Environment:

Kubernetes version (use kubectl version): 1.8.5
Cloud provider or hardware configuration: arm (Raspberry Pi)
OS (e.g. from /etc/os-release): hypriot/raspbian
Kernel (e.g. uname -a):4.4.50
Others:

What happened?

tried kubeadm upgrade .. which timed out.

Manually copied in kube-apiserver.yaml that kubeadm generated.

What you expected to happen?

Upgrade to 1.9.0 should work.

How to reproduce it (as minimally and precisely as possible)?

Install a 1.8.5 cluster, upgrade to 1.9.0 using kubeadm

Anything else we need to know?

Apiserver logs look like:

1218 04:40:42.704397       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.Role: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/roles?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.705841       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.ResourceQuota: Get https://127.0.0.1:6443/api/v1/resourcequotas?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.707026       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.ClusterRole: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.708110       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.RoleBinding: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.709105       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.ServiceAccount: Get https://127.0.0.1:6443/api/v1/serviceaccounts?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.710080       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Pod: Get https://127.0.0.1:6443/api/v1/pods?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.711157       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.PersistentVolume: Get https://127.0.0.1:6443/api/v1/persistentvolumes?limit=500&resourceVersion=0: net/http: TLS handshake timeout
E1218 04:40:42.712340       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *rbac.ClusterRoleBinding: Get https://127.0.0.1:6443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings?limit=500&resourceVersion=0: net/http: TLS handshake timeout
I1218 04:40:42.717755       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44016: EOF
I1218 04:40:42.746483       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39322: EOF
I1218 04:40:42.792235       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39326: EOF
I1218 04:40:42.873760       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36825: EOF
I1218 04:40:42.887385       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44010: EOF
I1218 04:40:42.906466       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59682: EOF
I1218 04:40:42.961715       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46824: EOF
I1218 04:40:42.983181       1 logs.go:41] http: TLS handshake error from 10.0.0.4:42166: EOF
I1218 04:40:43.035847       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36844: EOF
I1218 04:40:43.073853       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39706: EOF
I1218 04:40:43.101099       1 logs.go:41] http: TLS handshake error from 10.0.0.3:43986: EOF
I1218 04:40:43.106547       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46846: EOF
I1218 04:40:43.124883       1 logs.go:41] http: TLS handshake error from 10.0.0.2:59200: EOF
I1218 04:40:43.135636       1 logs.go:41] http: TLS handshake error from 10.0.0.2:38988: EOF
I1218 04:40:43.139734       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39344: EOF
I1218 04:40:43.276876       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59676: read tcp 127.0.0.1:6443->127.0.0.1:59676: read: connection reset by peer
I1218 04:40:43.295881       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36894: EOF
I1218 04:40:43.328730       1 logs.go:41] http: TLS handshake error from 10.0.0.2:39052: EOF
I1218 04:40:43.437586       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59668: EOF
I1218 04:40:43.457870       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59684: read tcp 127.0.0.1:6443->127.0.0.1:59684: read: connection reset by peer
I1218 04:40:43.463332       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39698: EOF
I1218 04:40:43.482961       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40512: EOF
I1218 04:40:43.543943       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39312: EOF
I1218 04:40:43.598015       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39330: EOF
I1218 04:40:43.638007       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36856: EOF
I1218 04:40:43.661470       1 logs.go:41] http: TLS handshake error from 10.0.0.3:58758: EOF
I1218 04:40:43.685554       1 logs.go:41] http: TLS handshake error from 10.0.0.3:44012: EOF
I1218 04:40:43.710389       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39711: EOF
I1218 04:40:43.714225       1 logs.go:41] http: TLS handshake error from 10.0.0.2:46822: EOF
I1218 04:40:43.720630       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39400: EOF
I1218 04:40:43.741250       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59654: EOF
I1218 04:40:43.947767       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39404: EOF
E1218 04:40:43.949289       1 client_ca_hook.go:78] Post https://127.0.0.1:6443/api/v1/namespaces: net/http: TLS handshake timeout
F1218 04:40:43.950279       1 controller.go:133] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://127.0.0.1:6443/api/v1/services: net/http: TLS handshake timeout
I1218 04:40:44.639152       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40712: EOF
I1218 04:40:46.267009       1 logs.go:41] http: TLS handshake error from 10.0.0.4:42148: EOF
I1218 04:40:46.267803       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39664: EOF
I1218 04:40:46.268393       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40482: EOF
I1218 04:40:46.268963       1 logs.go:41] http: TLS handshake error from 10.0.0.1:39350: EOF
I1218 04:40:46.269512       1 logs.go:41] http: TLS handshake error from 10.0.0.4:36906: EOF
I1218 04:40:46.269994       1 logs.go:41] http: TLS handshake error from 10.0.0.2:40474: EOF
I1218 04:40:46.270533       1 logs.go:41] http: TLS handshake error from 127.0.0.1:59686: EOF

luxas commented 6 years ago

Is etcd still working? Can you paste the output of kubeadm upgrade? I could try to reproduce this as well on an ARM machine -- we have automated upgrade tests running for the normal case so I guess this might be something arm32-specific...?

brendandburns commented 6 years ago

etcd is still working (though I had to manually upgrade etcd to 3.1.10 because the kubeadm upgrade timed out before etcd came back up when it tried to upgrade)

When I revert back to the old 1.8.5 apiserver, the whole cluster snaps back into correct operation.

I'll try the upgrade again this evening and I'll send in more detailed logs.

brendandburns commented 6 years ago

Here's the output from kubeadm

[upgrade/version] You have chosen to change the cluster version to "v1.9.0"
[upgrade/versions] Cluster version: v1.8.5
[upgrade/versions] kubeadm version: v1.9.0
[upgrade/confirm] Are you sure you want to proceed with the upgrade? [y/N]: y
[upgrade/prepull] Will prepull images for components [kube-apiserver kube-controller-manager kube-scheduler]
[upgrade/apply] Upgrading your Static Pod-hosted control plane to version "v1.9.0"...
[upgrade/staticpods] Writing new Static Pod manifests to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/tmp/kubeadm-upgraded-manifests105458021/kube-scheduler.yaml"
[upgrade/staticpods] Moved new manifest to "/etc/kubernetes/manifests/kube-apiserver.yaml" and backed up old manifest to "/etc/kubernetes/tmp/kubeadm-backup-manifests586955648/kube-apiserver.yaml"
[upgrade/staticpods] Waiting for the kubelet to restart the component
[upgrade/apply] FATAL: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: [timed out waiting for the condition]

brendandburns commented 6 years ago

So I dug into this a little more. I think there are two underlying issues:

1) The default "time-to-healthy" for the apiserver is too short (at least on my rpis...) it is set to 15 seconds, but it takes longer than that on my node for the apiserver to come up. Changing it to 300 fixed things, this should probably be configurable in kubeadm...

2) Kubernetes 1.9.0 appears to be right on the edge in terms of memory use for what an rpi stack can handle. At steady state, my master node as ~60Mb of RAM free, and when an APIserver is just coming up and under heavy load from various components, it drops even lower than that.

Not too much that can be done here, I pulled a profile, and though there are some improvements that could help things, there's no low-hanging fruit...

The "right" answer would be to move etcd or some other component to a different node to relieve some of the memory pressure.

0xmichalis commented 6 years ago

Kubernetes 1.9.0 appears to be right on the edge in terms of memory use for what an rpi stack can handle. At steady state, my master node as ~60Mb of RAM free, and when an APIserver is just coming up and under heavy load from various components, it drops even lower than that.

Our docs already suggest using at least 2gb of RAM machines. This is unfortunate for rpis but there are other ARM options like odroid c2 that cover the requirements plus are known to run k8s (and are known to outperform rpis). I am waiting for two rock64 machines with 4gb of RAM each, hoping to get them working, too.

Closing in favor of https://github.com/kubernetes/kubeadm/issues/644

/close

0xmichalis commented 6 years ago

Also, this may be an issue with the OS you are running. I am also using raspberry pis, running with the stock raspbian lite image, and have performed all upgrades ever since 1.7 successfully (up to the latest - 1.9.1). There are also even more lightweight alternatives like dietpi which I can confirm is working like a dream on a rpi as a k8s node.

kubernetes / kubeadm