k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
26.71k stars 2.24k forks source link

After a complete power off: the server is currently unable to handle the request #2249

Closed PPCM closed 3 years ago

PPCM commented 3 years ago

Environmental Info: K3s Version: k3s version v1.19.1-rc1+k3s1 (041f18f6)

Node(s) CPU architecture, OS, and Version: Linux cluster01 5.4.0-1018-raspi #20-Ubuntu SMP Sun Sep 6 05:11:16 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

Cluster Configuration: 3 masters

Describe the bug: After a fresh install, all nodes were up correctly. After complete power off of all nodes, when devices come up, nodes are not able to reconnect.

Steps To Reproduce: Installation of K3s on first master node:

curl -sfL https://get.k3s.io | K3S_CLUSTER_INIT=true INSTALL_K3S_CHANNEL=testing sh -

On 2 others masters, installation of K3s

curl -sfL https://get.k3s.io | K3S_URL=${k3s_url} K3S_TOKEN=${k3s_token} INSTALL_K3S_CHANNEL=testing bash -s server

with the correct first node URL and token

After the power off, here is the error on first created master node

Error from server (ServiceUnavailable): the server is currently unable to handle the request

And on other master nodes

The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

Expected behavior: After a shutdown, all clusters should reconnect. At least, one of then should start correctly

Actual behavior: Connection error to nodes

Additional context / logs: On first master, here are cyclic messages

Sep 15 00:03:16 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49416: remote error: tls: bad certificate
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba is starting a new election at term 1287
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba became candidate at term 1288
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba received MsgVoteResp from 6571fb7574e87dba at term 1288
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba [logterm: 3, index: 4142] sent MsgVote request to 5b9648b5cb0fc41a at term 1288
Sep 15 00:03:16 cluster01 k3s[1706]: raft2020/09/15 00:03:16 INFO: 6571fb7574e87dba [logterm: 3, index: 4142] sent MsgVote request to 7e3172f2dd9d22d5 at term 1288
Sep 15 00:03:18 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.13:39492: remote error: tls: bad certificate
Sep 15 00:03:18 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49424: remote error: tls: bad certificate
Sep 15 00:03:19 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.12:55948: remote error: tls: bad certificate
Sep 15 00:03:19 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:19 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:19 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:19 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:20 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49432: remote error: tls: bad certificate
Sep 15 00:03:22 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49440: remote error: tls: bad certificate
Sep 15 00:03:24 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.13:39496: remote error: tls: bad certificate
Sep 15 00:03:24 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49448: remote error: tls: bad certificate
Sep 15 00:03:24 cluster01 k3s[1706]: {"level":"warn","ts":"2020-09-15T00:03:24.661Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Sep 15 00:03:24 cluster01 k3s[1706]: time="2020-09-15T00:03:24.661907139Z" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 15 00:03:24 cluster01 k3s[1706]: WARNING: 2020/09/15 00:03:24 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...
Sep 15 00:03:24 cluster01 k3s[1706]: WARNING: 2020/09/15 00:03:24 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
Sep 15 00:03:24 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:24 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:24 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:24 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba is starting a new election at term 1288
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba became candidate at term 1289
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba received MsgVoteResp from 6571fb7574e87dba at term 1289
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba [logterm: 3, index: 4142] sent MsgVote request to 5b9648b5cb0fc41a at term 1289
Sep 15 00:03:25 cluster01 k3s[1706]: raft2020/09/15 00:03:25 INFO: 6571fb7574e87dba [logterm: 3, index: 4142] sent MsgVote request to 7e3172f2dd9d22d5 at term 1289
Sep 15 00:03:25 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.12:55952: remote error: tls: bad certificate
Sep 15 00:03:26 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49456: remote error: tls: bad certificate
Sep 15 00:03:28 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49464: remote error: tls: bad certificate
Sep 15 00:03:29 cluster01 k3s[1706]: publish error: etcdserver: request timed out
Sep 15 00:03:29 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:29 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:29 cluster01 k3s[1706]: health check for peer 7e3172f2dd9d22d5 could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 15 00:03:29 cluster01 k3s[1706]: health check for peer 5b9648b5cb0fc41a could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 15 00:03:30 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.13:39500: remote error: tls: bad certificate
Sep 15 00:03:30 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49472: remote error: tls: bad certificate
Sep 15 00:03:31 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.12:55956: remote error: tls: bad certificate
Sep 15 00:03:32 cluster01 k3s[1706]: http: TLS handshake error from 192.168.10.14:49480: remote error: tls: bad certificate

On other masters, cyclic messages

Sep 15 00:05:19 cluster02 k3s[41498]: time="2020-09-15T00:05:19.655182251Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs at https://cluster01:6443/cacerts: https://cluster01:6443/cacerts: 503 Service Unavailable"
Sep 15 00:05:19 cluster02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Sep 15 00:05:19 cluster02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Sep 15 00:05:19 cluster02 systemd[1]: Failed to start Lightweight Kubernetes.
Sep 15 00:05:24 cluster02 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 1579.
Sep 15 00:05:24 cluster02 systemd[1]: Stopped Lightweight Kubernetes.
Sep 15 00:05:24 cluster02 systemd[1]: Starting Lightweight Kubernetes...
Sep 15 00:05:25 cluster02 k3s[41527]: time="2020-09-15T00:05:25.507939367Z" level=info msg="Starting k3s v1.19.1-rc1+k3s1 (041f18f6)"
Sep 15 00:05:25 cluster02 k3s[41527]: time="2020-09-15T00:05:25.666961273Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs at https://cluster01:6443/cacerts: https://cluster01:6443/cacerts: 503 Service Unavailable"
Sep 15 00:05:25 cluster02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Sep 15 00:05:25 cluster02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Sep 15 00:05:25 cluster02 systemd[1]: Failed to start Lightweight Kubernetes.
Sep 15 00:05:30 cluster02 systemd[1]: k3s.service: Scheduled restart job, restart counter is at 1580.
Sep 15 00:05:30 cluster02 systemd[1]: Stopped Lightweight Kubernetes.
Sep 15 00:05:30 cluster02 systemd[1]: Starting Lightweight Kubernetes...
Sep 15 00:05:31 cluster02 k3s[41549]: time="2020-09-15T00:05:31.512641858Z" level=info msg="Starting k3s v1.19.1-rc1+k3s1 (041f18f6)"
Sep 15 00:05:31 cluster02 k3s[41549]: time="2020-09-15T00:05:31.671164367Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs at https://cluster01:6443/cacerts: https://cluster01:6443/cacerts: 503 Service Unavailable"
Sep 15 00:05:31 cluster02 systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Sep 15 00:05:31 cluster02 systemd[1]: k3s.service: Failed with result 'exit-code'.
Sep 15 00:05:31 cluster02 systemd[1]: Failed to start Lightweight Kubernetes.
PPCM commented 3 years ago

Nobody has this issue? Do I missed something ??

onedr0p commented 3 years ago

Looks like we're the beta testers @PPCM ;)

I have this same issue with 1.19.1-k3s1

k get nodes -o wide
NAME           STATUS   ROLES         AGE   VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-master-a   Ready    etcd,master   51s   v1.19.1+k3s1   192.168.42.100   <none>        Ubuntu 20.04.1 LTS   5.4.0-47-generic   containerd://1.4.0-k3s1
k8s-master-b   Ready    etcd,master   36s   v1.19.1+k3s1   192.168.42.101   <none>        Ubuntu 20.04.1 LTS   5.4.0-47-generic   containerd://1.4.0-k3s1
k8s-master-c   Ready    etcd,master   25s   v1.19.1+k3s1   192.168.42.102   <none>        Ubuntu 20.04.1 LTS   5.4.0-47-generic   containerd://1.4.0-k3s1

After reboot and even waiting 15+ minutes:

k get nodes -o wide
Error from server (ServiceUnavailable): the server is currently unable to handle the request

Output of journalctl --no-pager -u k3s.service -b

master-a logs: k3s-a.log master-b logs: k3s-b.log master-c logs: k3s-c.log

brandond commented 3 years ago

Thanks for the feedback y'all! We'll investigate this as soon as possible. We expected there might be a few corner cases with the embedded etcd cluster which is why it's still experimental, but it should stabilize soon.

onedr0p commented 3 years ago

I redeployed the master nodes and this time with --debug turned on, logs are below are after a reboot

k3s-a.log k3s-b.log k3s-c.log

192.168.42.99 is my keepalived VIP which has the following configuration:

global_defs {
    router_id LVS_DEVEL
}

vrrp_script chk_k3s_server {
    script "killall -0 k3s-server"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eno1
    virtual_router_id 1
    priority 150
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass kubernetes
    }
    virtual_ipaddress {
        192.168.42.99/24
    }
    track_script {
        chk_k3s_server
    }
}

I also noticed that when I only reboot the first master node (a), once it comes back up and k3s-server has started I get the same issue Error from server (ServiceUnavailable): the server is currently unable to handle the request. It took me running systemctl restart k3s on that node to bring it back up. Full reboot still borks it all though.

brandond commented 3 years ago

Does this work without the keepalived VIP?

Also, Pi's have some odd issues with systemd starting things before the network is actually up. Do you see any change in behavior if you modify your k3s systemd unit as described here: https://github.com/rancher/k3s/pull/2210/files

onedr0p commented 3 years ago

I will report back on removing keepalived and seeing if I experience the same issue. No RasPis here, I am using Intel NUCs, also my systemd file already had After=network.target in it.

brandond commented 3 years ago

@onedr0p if you're not on the same platform as this issue can you open a new issue and fill out the template? It's hard to track affected system configurations when people me-too on other issues without providing all the info.

onedr0p commented 3 years ago

Removing the keepalived VIP didn't work, meaning the master nodes still do not come back online after a reboot.

@brandond Apologies for not including my hardware specs initially, but these issues are too identical, no? I would expect a fix for one architecture would be a fix for the other architecture.

brandond commented 3 years ago

That depends on what the root cause is. Different OS distribution and different hardware merit a new issue.

PPCM commented 3 years ago

I made a test with the new RC

cluster01   Ready    etcd,master   83m   v1.19.2-rc2+k3s1
cluster02   Ready    etcd,master   82m   v1.19.2-rc2+k3s1
cluster03   Ready    etcd,master   81m   v1.19.2-rc2+k3s1

The result is the same, the cluster is still down. But logs are different:

Sep 19 01:31:42 cluster01 k3s[2164]: time="2020-09-19T01:31:42.947291646+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:42 http: TLS handshake error from 192.168.10.13:49048: remote error: tls: bad certificate"
Sep 19 01:31:43 cluster01 k3s[2164]: time="2020-09-19T01:31:43.000036202+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:42 http: TLS handshake error from 192.168.10.12:54066: remote error: tls: bad certificate"
Sep 19 01:31:43 cluster01 k3s[2164]: time="2020-09-19T01:31:43.418649172+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:43 http: TLS handshake error from 192.168.10.14:47548: remote error: tls: bad certificate"
Sep 19 01:31:45 cluster01 k3s[2164]: time="2020-09-19T01:31:45.438618592+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:45 http: TLS handshake error from 192.168.10.14:47556: remote error: tls: bad certificate"
Sep 19 01:31:46 cluster01 k3s[2164]: {"level":"warn","ts":"2020-09-19T01:31:46.046+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Sep 19 01:31:46 cluster01 k3s[2164]: time="2020-09-19T01:31:46.047078765+02:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 19 01:31:46 cluster01 k3s[2164]: WARNING: 2020/09/19 01:31:46 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba is starting a new election at term 60
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba became candidate at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba received MsgVoteResp from 6571fb7574e87dba at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba [logterm: 3, index: 26854] sent MsgVote request to 3928f76ae7bbea7b at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba [logterm: 3, index: 26854] sent MsgVote request to a1da96e589c651ed at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: time="2020-09-19T01:31:47.456692788+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:47 http: TLS handshake error from 192.168.10.14:47564: remote error: tls: bad certificate"
Sep 19 01:31:48 cluster01 k3s[2164]: time="2020-09-19T01:31:48.954812267+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:48 http: TLS handshake error from 192.168.10.13:49052: remote error: tls: bad certificate"
Sep 19 01:31:49 cluster01 k3s[2164]: time="2020-09-19T01:31:49.011361853+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:49 http: TLS handshake error from 192.168.10.12:54070: remote error: tls: bad certificate"
Sep 19 01:31:49 cluster01 k3s[2164]: time="2020-09-19T01:31:49.476812173+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:49 http: TLS handshake error from 192.168.10.14:47572: remote error: tls: bad certificate"
Sep 19 01:31:51 cluster01 k3s[2164]: publish error: etcdserver: request timed out
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
brandond commented 3 years ago

Yes, it appears to be deadlocked. Node 1 is waiting for 2 and 3 to come up before etcd will start, as a single node does not have quorum. Nodes 2 and 3 are waiting for node 1 to come up so that they can bootstrap the certs, and can't proceed past that to start etcd. I'm not sure why nodes 2 and 3 are trying to load certs again though, as this should have been completed when they came up the first time.

cjellick commented 3 years ago

This is definitely a blocker for GAing etcd support. We’ll figure it out

cjellick commented 3 years ago

Cc @davidnuzik

brandond commented 3 years ago

On the primary server, this happens because we block starting the http listener until the datastore finishes starting. Obviously the datastore can't come up until quorum is reached, which requires additional servers.

On the secondary servers this happens because we try to validate the client token here: https://github.com/rancher/k3s/blob/master/pkg/cluster/bootstrap.go#L43 before checking to see if we've already bootstrapped here: https://github.com/rancher/k3s/blob/master/pkg/cluster/bootstrap.go#L56 all of which have to complete before we start the datastore (etcd).

The workaround is to remove K3S_TOKEN and K3S_URL from /etc/systemd/system/k3s.service.env on the secondary servers after the first startup so it doesn't try to bootstrap again.

The real fix is to either correct the check order, or start up the http listener earlier so that it can respond to bootstrap requests before etcd is up. The first option is probably simpler, assuming we can handle any side effects of switching the checks.

PPCM commented 3 years ago

I tried the workaround and it works fine. Thanks! Do you think the real fix will be done in the next release?

onedr0p commented 3 years ago

We're all so eager to get this to work :D

@brandond I am doing a different method of install (https://github.com/PyratLabs/ansible-role-k3s), which includes the flags in the systemd files. I am still unable to make the cluster come back after reboot. Maybe you can see why looking at my systemd files?

ansible all -i inventory/custom/hosts.yml -b -m shell -a "cat /etc/systemd/system/k3s.service"

k8s-master-a | CHANGED | rc=0 >>

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target

[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --cluster-init --tls-san 192.168.42.19 --node-ip 192.168.42.20 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

k8s-master-c | CHANGED | rc=0 >>

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target

[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --server https://192.168.42.19:6443 --token K10a4e55644c307802de8b9b60d40a902e4e72dd3204994feeb481e63e5823ed4ea::server:12006d0e29e047265baf3c1fc06c6ec9 --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

k8s-master-b | CHANGED | rc=0 >>

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target

[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --server https://192.168.42.19:6443 --token K10a4e55644c307802de8b9b60d40a902e4e72dd3204994feeb481e63e5823ed4ea::server:12006d0e29e047265baf3c1fc06c6ec9 --tls-san 192.168.42.19 --node-ip 192.168.42.21 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.targe

Also random question, if we label a worker node as etcd, will it also be added to the quorum? Or it that specifically for master nodes?

brandond commented 3 years ago

@onedr0p in that case you will need to remove the --server and --token flags from the server command line.

Only servers (nodes started with k3s server) run the apiserver and datastore (etcd). Adding labels to agents won't make them a server.

onedr0p commented 3 years ago

@brandond I removed --server and --token from one of the secondary masters. Then issued sudo systemctl daemon-reload and then sudo systemctl restart k3s. The service came back healthy and the node was still working. However on reboot of this node it did not want to join back.

Systemd service:

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target

[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target

Here is the logs for that node after reboot

Sep 21 08:16:38 k8s-master-c systemd[1]: Starting Lightweight Kubernetes...
Sep 21 08:16:38 k8s-master-c k3s[521]: time="2020-09-21T08:16:38.429345552-04:00" level=info msg="Starting k3s v1.19.2-rc2+k3s1 (d38505b1)"
Sep 21 08:16:38 k8s-master-c k3s[521]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 08:16:38 k8s-master-c k3s[521]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 08:16:38 k8s-master-c k3s[521]: peerTLS: cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file =
Sep 21 08:16:38 k8s-master-c k3s[521]: time="2020-09-21T08:16:38.447441699-04:00" level=info msg="Running kube-apiserver --advertise-address=192.168.42.22 --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=unknown --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
Sep 21 08:16:38 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:38 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:39 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:39 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:39 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:39 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:40 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:40 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:41 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:41 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:43 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:43 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:44 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:44 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:47 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:48 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:48 k8s-master-c k3s[521]: {"level":"warn","ts":"2020-09-21T08:16:48.447-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 08:16:48 k8s-master-c k3s[521]: time="2020-09-21T08:16:48.447933340-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 08:16:53 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:53 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:53 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:53 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:54 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:54 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:55 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:55 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:58 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:58 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:17:02 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:17:02 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:17:03 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:17:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...

At least it's now not complaining about certificates :)

Edit: I may try to do a manual install to see if this can be replicated. However reviewing the ansible role (PyratLabs/ansible-role-k3s) I do not see anything that stands out as to why it would not work.

brandond commented 3 years ago

@onedr0p on k8s-master-c, can you change your command from k3s server to k3s --debug server? It looks like etcd is not finishing startup for some reason. Did you do anything to the files on disk?

onedr0p commented 3 years ago

Sure here you go, the systemd file is the same but with --debug added, here is the log after reboot.

-- Logs begin at Mon 2020-09-21 10:32:02 EDT, end at Mon 2020-09-21 13:52:32 EDT. --
Sep 21 13:51:47 k8s-master-c systemd[1]: Starting Lightweight Kubernetes...
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47-04:00" level=debug msg="Asset dir /var/lib/rancher/k3s/data/07bf5246a6ab2127234428fbf3023ce85b0376a7b946bbdaee2726b7d9a6fad8"
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47-04:00" level=debug msg="Running /var/lib/rancher/k3s/data/07bf5246a6ab2127234428fbf3023ce85b0376a7b946bbdaee2726b7d9a6fad8/bin/k3s-server [/usr/local/bin/k3s --debug server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --cluster-dns 10.43.0.10 --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true]"
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47.400486793-04:00" level=info msg="Starting k3s v1.19.2+k3s1 (d38505b1)"
Sep 21 13:51:47 k8s-master-c k3s[526]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 13:51:47 k8s-master-c k3s[526]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 13:51:47 k8s-master-c k3s[526]: peerTLS: cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file =
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47.419057703-04:00" level=info msg="Running kube-apiserver --advertise-address=192.168.42.22 --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=unknown --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
Sep 21 13:51:47 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:47 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:48 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:48 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:49 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:49 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:50 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:50 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:51 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:51 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:52 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:52 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:56 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:56 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:57 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:51:57.419-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:51:57 k8s-master-c k3s[526]: time="2020-09-21T13:51:57.420272065-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:51:57 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:57 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:02 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:02 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:03 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:03 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:04 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:04 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:07 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:07 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:11 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:11 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:12 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:52:12.420-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:52:12 k8s-master-c k3s[526]: time="2020-09-21T13:52:12.420999478-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:52:12 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:12 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:17 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:17 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:18 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:18 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:20 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:20 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:22 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:22 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:26 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:26 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:27 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:52:27.421-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:52:27 k8s-master-c k3s[526]: time="2020-09-21T13:52:27.421641003-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:52:29 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:29 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:32 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:32 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...

Edit 1: it appears after I run sudo systemctl restart k3s the service becomes active. Maybe there is some race condition or something happening on boot.

Edit 2: it appears for my case that my systemd file was After=network.service and NOT After=network-online.service

Sorry for missing that 🤦 The node now comes back healthy after removing --token and --server flags and a reboot. 🥳 Thanks for the help!

brandond commented 3 years ago

If you want to live dangerously, you can try out the fix via INSTALL_K3S_COMMIT=48ed47c4a3e420fa71c18b2ec97f13dc0659778b

The kubernetes API will still not come up until quorum is reached - this is just how etcd works. All k3s services should start successfully, and wait for a quorum of etcd members to become available, at which point the apiserver will come online.

cubed-it commented 3 years ago

@brandond the commit reboots successfully, everything looks good. The log is full of message I do not understand. If anyone like to check, I attached it. node1.txt

I noticed one thing: The first node takes ~10 minutes to respond to any ssh requests after a reboot. It is a reproducable behaviour. ssh is available as soon as k3s is available. Any suggestions? :/

onedr0p commented 3 years ago

@brandond I have install that commit and the masters are coming online after reboot. I don't see the 10 minute delay like @cubed-it mentioned.

brandond commented 3 years ago

@cubed-it that doesn't sound like a k3s issue. It sounds like something is timing out on boot and delaying network startup. Have you checked your system logs?

cubed-it commented 3 years ago

@brandond just in the last few minutes I have been analyzing the problem further and it seems to be the operating system. When I issue a "reboot" command, everything runs fine until "Stopping Restore / save the current clock...". Whatever that means...

So everything is good with k3s ;) Even more than good I have to say! The 1.19 etcd performance compared to 1.18 sqlite is amazing. I observe only half of the base load on the cpu.

Postscript: There were kernel, bootloader and eeporm updates pending. After I installed them all, the problem seems to have disappeared as suddenly as it appeared.

ShylajaDevadiga commented 3 years ago

Reproduced the issue using v1.19.1-rc1+k3s1 and validated the fix using commit ID 6b11d860374f938623479b0d06aeb3c1ac87ce86.

Created cluster with 3 server nodes. Rebooted all nodes. All nodes were able to reconnect. Commands used: Server 1:

curl -sfL https://get.k3s.io | K3S_CLUSTER_INIT=true INSTALL_K3S_COMMIT=6b11d860374f938623479b0d06aeb3c1ac87ce86 sh -

Server 2 and 3:

curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=6b11d860374f938623479b0d06aeb3c1ac87ce86 sh -s - server --token=TOKEN --server=https://FIRSTNODEIP:6443

After reboot:

kubectl get nodes 
NAME               STATUS   ROLES         AGE   VERSION
ip-172-31-10-253   Ready    etcd,master   13m   v1.19.2+k3s-6b11d860
ip-172-31-11-247   Ready    etcd,master   13m   v1.19.2+k3s-6b11d860
ip-172-31-5-59     Ready    etcd,master   16m   v1.19.2+k3s-6b11d860
mamiu commented 3 years ago

@ShylajaDevadiga @brandond

Is there a reason why this fix (the commit ID from @ShylajaDevadiga's comment - which is indeed solving this bug) is not included in the latest v1.19.3+k3s1 release? This issue has the milestone v1.19.3+k3s1 and is listed as closed, but it's not included in the latest release channel (which is currently v1.19.3+k3s1) and therefore an installation with INSTALL_K3S_CHANNEL=latest still has this issue.

brandond commented 3 years ago

Not everything makes the planned milestone. The upstream release had several CVEs, so we wanted to turn it around quickly. We've got another release planned soon (independent of upstream patches) that will include backported fixes from the master branch.

dharmendrakariya commented 3 years ago

Not everything makes the planned milestone. The upstream release had several CVEs, so we wanted to turn it around quickly. We've got another release planned soon (independent of upstream patches) that will include backported fixes from the master branch.

Hi @brandond , I am using this setup Exercise I Performed , PS: I am using K3Os in Node2 and Node3, and I am facing this issue. After reboot the setup is not coming up. Can you suggest any particular for this case.

brandond commented 3 years ago

@dharmendrakariya this issue was fixed in the PR linked above. Please confirm that you are running a release that includes the fix.

dharmendrakariya commented 3 years ago

@brandond okay great, but I think K3os doesn't have this patch. Because today only I faced this one.

brandond commented 3 years ago

Ah, in that case you should probably open an issue in the k3os repo so that it can be tracked there.

dharmendrakariya commented 3 years ago

Oops my bad, will do that. You are right!