rancher / k3os

Purpose-built OS for Kubernetes, fully managed by Kubernetes.
https://k3os.io
Apache License 2.0
3.5k stars 403 forks source link

After k3os-upgrade-rootfs to v0.6.0, master fails to come up #268

Closed sgielen closed 5 years ago

sgielen commented 5 years ago

I ran k3os-upgrade-rootfs on a v0.5.0 k3s node (karen) and its master (kathleen) simultaneously:

karen [~]$ sudo /usr/sbin/k3os-upgrade-rootfs 
Upgrading k3os to v0.6.0
Upgrade complete! Please reboot.
kathleen [~]$ sudo k3os-upgrade-rootfs 
Upgrading k3os to v0.6.0
Upgrade complete! Please reboot.

However, after rebooting both nodes, the non-master node fails to come up with the following in the logs:

time="2019-11-09T15:46:17.731051020Z" level=info msg="Starting k3s agent v0.10.2 (8833bfd9)"
time="2019-11-09T15:46:17.732214895Z" level=info msg="Running load balancer 127.0.0.1:33887 -> [192.168.178.138:6443 kathleen:6443]"
time="2019-11-09T15:46:17.742921020Z" level=error msg="failed to get CA certs at https://127.0.0.1:33887/cacerts: Get https://127.0.0.1:33887/cacerts: read tcp 127.0.0.1:33322->127.0.0.1:33887: read: connection reset by peer"
time="2019-11-09T15:46:19.755213021Z" level=error msg="failed to get CA certs at https://127.0.0.1:33887/cacerts: Get https://127.0.0.1:33887/cacerts: read tcp 127.0.0.1:33334->127.0.0.1:33887: read: connection reset by peer"
time="2019-11-09T15:46:21.767703688Z" level=error msg="failed to get CA certs at https://127.0.0.1:33887/cacerts: Get https://127.0.0.1:33887/cacerts: read tcp 127.0.0.1:33346->127.0.0.1:33887: read: connection reset by peer"
(...and so on...)

This seems to be because the master is not accepting TCP connections on port 6443 at all. The master seems to be in a restart loop, continuously reporting:

time="2019-11-09T15:46:23.863938606Z" level=info msg="Starting k3s v0.10.2 (8833bfd9)"
time="2019-11-09T15:46:23.887062772Z" level=info msg="Kine listening on unix://kine.sock"
time="2019-11-09T15:46:23.888571272Z" level=info msg="Fetching bootstrap data from etcd"
time="2019-11-09T15:46:23.997808804Z" level=info msg="Running kube-apiserver --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=unknown --authorization-mode=Node,RBAC --basic-auth-file=/var/lib/rancher/k3s/server/cred/passwd --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-servers=unix://kine.sock --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
Flag --basic-auth-file has been deprecated, Basic authentication mode is deprecated and will be removed in a future release. It is not recommended for production environments.
I1109 15:46:24.000622    2507 server.go:650] external host was not specified, using 192.168.178.138
I1109 15:46:24.001849    2507 server.go:162] Version: v1.16.2-k3s.1
I1109 15:46:24.048867    2507 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I1109 15:46:24.049298    2507 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I1109 15:46:24.054522    2507 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I1109 15:46:24.054861    2507 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
I1109 15:46:24.179977    2507 master.go:259] Using reconciler: lease
I1109 15:46:24.337468    2507 rest.go:115] the default service ipfamily for this cluster is: IPv4
W1109 15:46:26.165310    2507 genericapiserver.go:404] Skipping API batch/v2alpha1 because it has no resources.
W1109 15:46:26.279300    2507 genericapiserver.go:404] Skipping API node.k8s.io/v1alpha1 because it has no resources.
W1109 15:46:26.393629    2507 genericapiserver.go:404] Skipping API rbac.authorization.k8s.io/v1alpha1 because it has no resources.
W1109 15:46:26.415237    2507 genericapiserver.go:404] Skipping API scheduling.k8s.io/v1alpha1 because it has no resources.
W1109 15:46:26.489484    2507 genericapiserver.go:404] Skipping API storage.k8s.io/v1alpha1 because it has no resources.
W1109 15:46:26.626234    2507 genericapiserver.go:404] Skipping API apps/v1beta2 because it has no resources.
W1109 15:46:26.626414    2507 genericapiserver.go:404] Skipping API apps/v1beta1 because it has no resources.
I1109 15:46:26.694473    2507 plugins.go:158] Loaded 11 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,MutatingAdmissionWebhook,RuntimeClass.
I1109 15:46:26.694659    2507 plugins.go:161] Loaded 7 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,RuntimeClass,ResourceQuota.
time="2019-11-09T15:46:26.730227045Z" level=info msg="Running kube-scheduler --bind-address=127.0.0.1 --kubeconfig=/var/lib/rancher/k3s/server/cred/scheduler.kubeconfig --leader-elect=false --port=10251 --secure-port=0"
time="2019-11-09T15:46:26.744568545Z" level=info msg="Running kube-controller-manager --allocate-node-cidrs=true --bind-address=127.0.0.1 --cluster-cidr=10.42.0.0/16 --cluster-signing-cert-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --cluster-signing-key-file=/var/lib/rancher/k3s/server/tls/server-ca.key --kubeconfig=/var/lib/rancher/k3s/server/cred/controller.kubeconfig --leader-elect=false --port=10252 --root-ca-file=/var/lib/rancher/k3s/server/tls/server-ca.crt --secure-port=0 --service-account-private-key-file=/var/lib/rancher/k3s/server/tls/service.key --use-service-account-credentials=true"
I1109 15:46:26.774206    2507 controllermanager.go:161] Version: v1.16.2-k3s.1
I1109 15:46:26.776853    2507 deprecated_insecure_serving.go:53] Serving insecurely on [::]:10252
I1109 15:46:26.779083    2507 server.go:143] Version: v1.16.2-k3s.1
I1109 15:46:26.779488    2507 defaults.go:91] TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory
W1109 15:46:26.785818    2507 authorization.go:47] Authorization is disabled
W1109 15:46:26.786581    2507 authentication.go:79] Authentication is disabled
I1109 15:46:26.786952    2507 deprecated_insecure_serving.go:51] Serving healthz insecurely on [::]:10251
time="2019-11-09T15:46:36.762328816Z" level=info msg="Waiting for cloudcontroller rbac role to be created"
time="2019-11-09T15:46:36.763608520Z" level=fatal msg="starting tls server: Get https://127.0.0.1:6444/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions: net/http: TLS handshake timeout"
sgielen commented 5 years ago

I'm seeing the same messages from a node newly installed as v0.6.0 so it seems to be either 0.6.0 related or "the master is upgraded from 0.5.0 to 0.6.0" related.

I can attempt a fresh install of the master to 0.6.0 but am keeping things as they are now, in case you want me to try anything on the master. (It's OK to try destructive things, there's nothing important on it.)

sgielen commented 5 years ago

I do see connections being established to localhost:6444 on the master node:

kathleen [~]$ sudo netstat -plantu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp       14      0 127.0.0.1:6444          0.0.0.0:*               LISTEN      2193/k3s
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1988/sshd
tcp      209      0 127.0.0.1:6444          127.0.0.1:53538         ESTABLISHED -
tcp      209      0 127.0.0.1:6444          127.0.0.1:53536         ESTABLISHED -
tcp      209      0 127.0.0.1:6444          127.0.0.1:53542         ESTABLISHED -
tcp        0      0 127.0.0.1:53548         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0      0 127.0.0.1:53536         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0      0 127.0.0.1:53546         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0      0 127.0.0.1:53550         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp      209      0 127.0.0.1:6444          127.0.0.1:53544         ESTABLISHED -
tcp      209      0 127.0.0.1:6444          127.0.0.1:53526         ESTABLISHED -
tcp      209      0 127.0.0.1:6444          127.0.0.1:53546         ESTABLISHED -
tcp        0      0 127.0.0.1:53526         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0   1596 192.168.178.138:22      192.168.178.54:64451    ESTABLISHED 2058/sshd: rancher 
tcp        0      0 127.0.0.1:53538         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp      209      0 127.0.0.1:6444          127.0.0.1:53534         ESTABLISHED -
tcp      209      0 127.0.0.1:6444          127.0.0.1:53528         ESTABLISHED -
tcp        0      0 127.0.0.1:53540         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0      0 127.0.0.1:53534         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp      209      0 127.0.0.1:6444          127.0.0.1:53532         ESTABLISHED -
tcp      209      0 127.0.0.1:6444          127.0.0.1:53530         ESTABLISHED -
tcp      209      0 127.0.0.1:6444          127.0.0.1:53548         ESTABLISHED -
tcp        0      0 127.0.0.1:53544         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0      0 127.0.0.1:53532         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0      0 127.0.0.1:53542         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp      209      0 127.0.0.1:6444          127.0.0.1:53524         ESTABLISHED -
tcp        0      0 127.0.0.1:53530         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp      209      0 127.0.0.1:6444          127.0.0.1:53540         ESTABLISHED -
tcp      209      0 127.0.0.1:6444          127.0.0.1:53550         ESTABLISHED -
tcp        0      0 127.0.0.1:53524         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0      0 127.0.0.1:53528         127.0.0.1:6444          ESTABLISHED 2193/k3s
tcp        0      0 :::10251                :::*                    LISTEN      2193/k3s
tcp        0      0 :::10252                :::*                    LISTEN      2193/k3s
tcp        0      0 :::22                   :::*                    LISTEN      1988/sshd
tcp        0      1 2001:985:d85e:1:dea6:32ff:fe51:8cd4:50790 2001:8d8:8b4:c861:5826:fa5f:6690:0:80 SYN_SENT    1241/connmand
udp        0      0 0.0.0.0:52835           0.0.0.0:*                           1241/connmand
udp        0      0 192.168.178.138:55333   192.168.178.1:53        ESTABLISHED 1241/connmand
udp        0      0 192.168.178.138:45472   192.168.178.1:53        ESTABLISHED 1241/connmand
udp        0      0 192.168.178.138:58163   192.168.178.1:53        ESTABLISHED 1241/connmand
udp        0      0 2001:985:d85e:1:dea6:32ff:fe51:8cd4:46578 fd00::ca0e:14ff:fe09:1f57:53 ESTABLISHED 1241/connmand
udp        0      0 2001:985:d85e:1:dea6:32ff:fe51:8cd4:48633 fd00::ca0e:14ff:fe09:1f57:53 ESTABLISHED 1241/connmand
udp        0      0 2001:985:d85e:1:dea6:32ff:fe51:8cd4:39075 fd00::ca0e:14ff:fe09:1f57:53 ESTABLISHED 1241/connmand
udp        0      0 2001:985:d85e:1:dea6:32ff:fe51:8cd4:49894 fd00::ca0e:14ff:fe09:1f57:53 ESTABLISHED 1241/connmand
zimme commented 5 years ago

This is an issue with k3s that has been resolved with rancher/k3s/pull/1007. It's included in the 0.11.0 pre-releases.

This seems to happen mostly on low spec devices/vms.

sgielen commented 5 years ago

I downloaded the v0.11.0-alpha2 prerelease of k3s-arm64, placed it at /k3os/system/k3s/v0.11.0-alpha2/k3s (after mount -o remount,rw /k3os/system) and updated the symlink at /k3os/system/k3s/current, then rebooted. After reboot, k3s does wait for the apiserver to come up as this message is printed after 10 seconds of waiting:

time="2019-11-10T12:13:13.312852148Z" level=info msg="waiting for apiserver to become available"

Then, five seconds later the apiserver is up. The nodes fail to join it, because of this error:

time="2019-11-10T12:14:05.859510419Z" level=error msg="password file '/var/lib/rancher/k3s/server/cred/node-passwd' must have at least 3 columns (password, user name, user uid), found 2"

Removing the lines of just the node attempting to join doesn't help; the file must be completely cleared and the node seems to join normally.

There's still various errors in the logs but it seems the original problem in this issue is indeed resolved by updating to the 0.11.0 pre-releases.

zimme commented 5 years ago

I just had a quick look and it seems as though @ibuildthecloud did a refactor on the node password functionality and that updated the format from 2 columns to 3 columns.

I don't know if he intended for a backwards compatible change but as k3s and k3os is pre 1.0 I don't think so and you can expect breaking changes when upgrading.

dweomer commented 5 years ago

Until things settle down with k3s (KubeCon is coming ...) issues like this keep me leery of in-place upgrades (either bundled as per the norm or manual out-of-band as was done to address this here). AKA @sgielen hit this because of an apparent lack of upgrade migration (which is to be expected from 0.X projects).

dweomer commented 5 years ago

@sgielen seems to have a viable work-around. Please re-open if the work-around is unsatisfactory.