Closed sgielen closed 5 years ago
I'm seeing the same messages from a node newly installed as v0.6.0 so it seems to be either 0.6.0 related or "the master is upgraded from 0.5.0 to 0.6.0" related.
I can attempt a fresh install of the master to 0.6.0 but am keeping things as they are now, in case you want me to try anything on the master. (It's OK to try destructive things, there's nothing important on it.)
I do see connections being established to localhost:6444 on the master node:
kathleen [~]$ sudo netstat -plantu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 14 0 127.0.0.1:6444 0.0.0.0:* LISTEN 2193/k3s
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1988/sshd
tcp 209 0 127.0.0.1:6444 127.0.0.1:53538 ESTABLISHED -
tcp 209 0 127.0.0.1:6444 127.0.0.1:53536 ESTABLISHED -
tcp 209 0 127.0.0.1:6444 127.0.0.1:53542 ESTABLISHED -
tcp 0 0 127.0.0.1:53548 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 0 127.0.0.1:53536 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 0 127.0.0.1:53546 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 0 127.0.0.1:53550 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 209 0 127.0.0.1:6444 127.0.0.1:53544 ESTABLISHED -
tcp 209 0 127.0.0.1:6444 127.0.0.1:53526 ESTABLISHED -
tcp 209 0 127.0.0.1:6444 127.0.0.1:53546 ESTABLISHED -
tcp 0 0 127.0.0.1:53526 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 1596 192.168.178.138:22 192.168.178.54:64451 ESTABLISHED 2058/sshd: rancher
tcp 0 0 127.0.0.1:53538 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 209 0 127.0.0.1:6444 127.0.0.1:53534 ESTABLISHED -
tcp 209 0 127.0.0.1:6444 127.0.0.1:53528 ESTABLISHED -
tcp 0 0 127.0.0.1:53540 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 0 127.0.0.1:53534 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 209 0 127.0.0.1:6444 127.0.0.1:53532 ESTABLISHED -
tcp 209 0 127.0.0.1:6444 127.0.0.1:53530 ESTABLISHED -
tcp 209 0 127.0.0.1:6444 127.0.0.1:53548 ESTABLISHED -
tcp 0 0 127.0.0.1:53544 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 0 127.0.0.1:53532 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 0 127.0.0.1:53542 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 209 0 127.0.0.1:6444 127.0.0.1:53524 ESTABLISHED -
tcp 0 0 127.0.0.1:53530 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 209 0 127.0.0.1:6444 127.0.0.1:53540 ESTABLISHED -
tcp 209 0 127.0.0.1:6444 127.0.0.1:53550 ESTABLISHED -
tcp 0 0 127.0.0.1:53524 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 0 127.0.0.1:53528 127.0.0.1:6444 ESTABLISHED 2193/k3s
tcp 0 0 :::10251 :::* LISTEN 2193/k3s
tcp 0 0 :::10252 :::* LISTEN 2193/k3s
tcp 0 0 :::22 :::* LISTEN 1988/sshd
tcp 0 1 2001:985:d85e:1:dea6:32ff:fe51:8cd4:50790 2001:8d8:8b4:c861:5826:fa5f:6690:0:80 SYN_SENT 1241/connmand
udp 0 0 0.0.0.0:52835 0.0.0.0:* 1241/connmand
udp 0 0 192.168.178.138:55333 192.168.178.1:53 ESTABLISHED 1241/connmand
udp 0 0 192.168.178.138:45472 192.168.178.1:53 ESTABLISHED 1241/connmand
udp 0 0 192.168.178.138:58163 192.168.178.1:53 ESTABLISHED 1241/connmand
udp 0 0 2001:985:d85e:1:dea6:32ff:fe51:8cd4:46578 fd00::ca0e:14ff:fe09:1f57:53 ESTABLISHED 1241/connmand
udp 0 0 2001:985:d85e:1:dea6:32ff:fe51:8cd4:48633 fd00::ca0e:14ff:fe09:1f57:53 ESTABLISHED 1241/connmand
udp 0 0 2001:985:d85e:1:dea6:32ff:fe51:8cd4:39075 fd00::ca0e:14ff:fe09:1f57:53 ESTABLISHED 1241/connmand
udp 0 0 2001:985:d85e:1:dea6:32ff:fe51:8cd4:49894 fd00::ca0e:14ff:fe09:1f57:53 ESTABLISHED 1241/connmand
This is an issue with k3s
that has been resolved with rancher/k3s/pull/1007. It's included in the 0.11.0 pre-releases.
This seems to happen mostly on low spec devices/vms.
I downloaded the v0.11.0-alpha2 prerelease of k3s-arm64, placed it at /k3os/system/k3s/v0.11.0-alpha2/k3s
(after mount -o remount,rw /k3os/system
) and updated the symlink at /k3os/system/k3s/current
, then rebooted. After reboot, k3s does wait for the apiserver to come up as this message is printed after 10 seconds of waiting:
time="2019-11-10T12:13:13.312852148Z" level=info msg="waiting for apiserver to become available"
Then, five seconds later the apiserver is up. The nodes fail to join it, because of this error:
time="2019-11-10T12:14:05.859510419Z" level=error msg="password file '/var/lib/rancher/k3s/server/cred/node-passwd' must have at least 3 columns (password, user name, user uid), found 2"
Removing the lines of just the node attempting to join doesn't help; the file must be completely cleared and the node seems to join normally.
There's still various errors in the logs but it seems the original problem in this issue is indeed resolved by updating to the 0.11.0 pre-releases.
I just had a quick look and it seems as though @ibuildthecloud did a refactor on the node password functionality and that updated the format from 2 columns to 3 columns.
I don't know if he intended for a backwards compatible change but as k3s and k3os is pre 1.0 I don't think so and you can expect breaking changes when upgrading.
Until things settle down with k3s (KubeCon is coming ...) issues like this keep me leery of in-place upgrades (either bundled as per the norm or manual out-of-band as was done to address this here). AKA @sgielen hit this because of an apparent lack of upgrade migration (which is to be expected from 0.X projects).
@sgielen seems to have a viable work-around. Please re-open if the work-around is unsatisfactory.
I ran k3os-upgrade-rootfs on a v0.5.0 k3s node (
karen
) and its master (kathleen
) simultaneously:However, after rebooting both nodes, the non-master node fails to come up with the following in the logs:
This seems to be because the master is not accepting TCP connections on port 6443 at all. The master seems to be in a restart loop, continuously reporting: