Closed PPCM closed 3 years ago
Nobody has this issue? Do I missed something ??
Looks like we're the beta testers @PPCM ;)
I have this same issue with 1.19.1-k3s1
k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-master-a Ready etcd,master 51s v1.19.1+k3s1 192.168.42.100 <none> Ubuntu 20.04.1 LTS 5.4.0-47-generic containerd://1.4.0-k3s1
k8s-master-b Ready etcd,master 36s v1.19.1+k3s1 192.168.42.101 <none> Ubuntu 20.04.1 LTS 5.4.0-47-generic containerd://1.4.0-k3s1
k8s-master-c Ready etcd,master 25s v1.19.1+k3s1 192.168.42.102 <none> Ubuntu 20.04.1 LTS 5.4.0-47-generic containerd://1.4.0-k3s1
After reboot and even waiting 15+ minutes:
k get nodes -o wide
Error from server (ServiceUnavailable): the server is currently unable to handle the request
Output of journalctl --no-pager -u k3s.service -b
master-a logs: k3s-a.log master-b logs: k3s-b.log master-c logs: k3s-c.log
Thanks for the feedback y'all! We'll investigate this as soon as possible. We expected there might be a few corner cases with the embedded etcd cluster which is why it's still experimental, but it should stabilize soon.
I redeployed the master nodes and this time with --debug
turned on, logs are below are after a reboot
192.168.42.99 is my keepalived VIP which has the following configuration:
global_defs {
router_id LVS_DEVEL
}
vrrp_script chk_k3s_server {
script "killall -0 k3s-server"
interval 2
weight 2
}
vrrp_instance VI_1 {
state MASTER
interface eno1
virtual_router_id 1
priority 150
advert_int 1
authentication {
auth_type PASS
auth_pass kubernetes
}
virtual_ipaddress {
192.168.42.99/24
}
track_script {
chk_k3s_server
}
}
I also noticed that when I only reboot the first master node (a), once it comes back up and k3s-server has started I get the same issue Error from server (ServiceUnavailable): the server is currently unable to handle the request
. It took me running systemctl restart k3s
on that node to bring it back up. Full reboot still borks it all though.
Does this work without the keepalived VIP?
Also, Pi's have some odd issues with systemd starting things before the network is actually up. Do you see any change in behavior if you modify your k3s systemd unit as described here: https://github.com/rancher/k3s/pull/2210/files
I will report back on removing keepalived and seeing if I experience the same issue. No RasPis here, I am using Intel NUCs, also my systemd file already had After=network.target
in it.
@onedr0p if you're not on the same platform as this issue can you open a new issue and fill out the template? It's hard to track affected system configurations when people me-too on other issues without providing all the info.
Removing the keepalived VIP didn't work, meaning the master nodes still do not come back online after a reboot.
@brandond Apologies for not including my hardware specs initially, but these issues are too identical, no? I would expect a fix for one architecture would be a fix for the other architecture.
That depends on what the root cause is. Different OS distribution and different hardware merit a new issue.
I made a test with the new RC
cluster01 Ready etcd,master 83m v1.19.2-rc2+k3s1
cluster02 Ready etcd,master 82m v1.19.2-rc2+k3s1
cluster03 Ready etcd,master 81m v1.19.2-rc2+k3s1
The result is the same, the cluster is still down. But logs are different:
Sep 19 01:31:42 cluster01 k3s[2164]: time="2020-09-19T01:31:42.947291646+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:42 http: TLS handshake error from 192.168.10.13:49048: remote error: tls: bad certificate"
Sep 19 01:31:43 cluster01 k3s[2164]: time="2020-09-19T01:31:43.000036202+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:42 http: TLS handshake error from 192.168.10.12:54066: remote error: tls: bad certificate"
Sep 19 01:31:43 cluster01 k3s[2164]: time="2020-09-19T01:31:43.418649172+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:43 http: TLS handshake error from 192.168.10.14:47548: remote error: tls: bad certificate"
Sep 19 01:31:45 cluster01 k3s[2164]: time="2020-09-19T01:31:45.438618592+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:45 http: TLS handshake error from 192.168.10.14:47556: remote error: tls: bad certificate"
Sep 19 01:31:46 cluster01 k3s[2164]: {"level":"warn","ts":"2020-09-19T01:31:46.046+0200","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Sep 19 01:31:46 cluster01 k3s[2164]: time="2020-09-19T01:31:46.047078765+02:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 19 01:31:46 cluster01 k3s[2164]: WARNING: 2020/09/19 01:31:46 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:46 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba is starting a new election at term 60
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba became candidate at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba received MsgVoteResp from 6571fb7574e87dba at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba [logterm: 3, index: 26854] sent MsgVote request to 3928f76ae7bbea7b at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: raft2020/09/19 01:31:47 INFO: 6571fb7574e87dba [logterm: 3, index: 26854] sent MsgVote request to a1da96e589c651ed at term 61
Sep 19 01:31:47 cluster01 k3s[2164]: time="2020-09-19T01:31:47.456692788+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:47 http: TLS handshake error from 192.168.10.14:47564: remote error: tls: bad certificate"
Sep 19 01:31:48 cluster01 k3s[2164]: time="2020-09-19T01:31:48.954812267+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:48 http: TLS handshake error from 192.168.10.13:49052: remote error: tls: bad certificate"
Sep 19 01:31:49 cluster01 k3s[2164]: time="2020-09-19T01:31:49.011361853+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:49 http: TLS handshake error from 192.168.10.12:54070: remote error: tls: bad certificate"
Sep 19 01:31:49 cluster01 k3s[2164]: time="2020-09-19T01:31:49.476812173+02:00" level=info msg="Cluster-Http-Server 2020/09/19 01:31:49 http: TLS handshake error from 192.168.10.14:47572: remote error: tls: bad certificate"
Sep 19 01:31:51 cluster01 k3s[2164]: publish error: etcdserver: request timed out
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer a1da96e589c651ed could not connect: dial tcp 192.168.10.12:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Sep 19 01:31:51 cluster01 k3s[2164]: health check for peer 3928f76ae7bbea7b could not connect: dial tcp 192.168.10.13:2380: connect: connection refused
Yes, it appears to be deadlocked. Node 1 is waiting for 2 and 3 to come up before etcd will start, as a single node does not have quorum. Nodes 2 and 3 are waiting for node 1 to come up so that they can bootstrap the certs, and can't proceed past that to start etcd. I'm not sure why nodes 2 and 3 are trying to load certs again though, as this should have been completed when they came up the first time.
This is definitely a blocker for GAing etcd support. We’ll figure it out
Cc @davidnuzik
On the primary server, this happens because we block starting the http listener until the datastore finishes starting. Obviously the datastore can't come up until quorum is reached, which requires additional servers.
On the secondary servers this happens because we try to validate the client token here: https://github.com/rancher/k3s/blob/master/pkg/cluster/bootstrap.go#L43 before checking to see if we've already bootstrapped here: https://github.com/rancher/k3s/blob/master/pkg/cluster/bootstrap.go#L56 all of which have to complete before we start the datastore (etcd).
The workaround is to remove K3S_TOKEN and K3S_URL from /etc/systemd/system/k3s.service.env on the secondary servers after the first startup so it doesn't try to bootstrap again.
The real fix is to either correct the check order, or start up the http listener earlier so that it can respond to bootstrap requests before etcd is up. The first option is probably simpler, assuming we can handle any side effects of switching the checks.
I tried the workaround and it works fine. Thanks! Do you think the real fix will be done in the next release?
We're all so eager to get this to work :D
@brandond I am doing a different method of install (https://github.com/PyratLabs/ansible-role-k3s), which includes the flags in the systemd files. I am still unable to make the cluster come back after reboot. Maybe you can see why looking at my systemd files?
ansible all -i inventory/custom/hosts.yml -b -m shell -a "cat /etc/systemd/system/k3s.service"
k8s-master-a | CHANGED | rc=0 >>
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target
[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --cluster-init --tls-san 192.168.42.19 --node-ip 192.168.42.20 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
k8s-master-c | CHANGED | rc=0 >>
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target
[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --server https://192.168.42.19:6443 --token K10a4e55644c307802de8b9b60d40a902e4e72dd3204994feeb481e63e5823ed4ea::server:12006d0e29e047265baf3c1fc06c6ec9 --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
k8s-master-b | CHANGED | rc=0 >>
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target
[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --server https://192.168.42.19:6443 --token K10a4e55644c307802de8b9b60d40a902e4e72dd3204994feeb481e63e5823ed4ea::server:12006d0e29e047265baf3c1fc06c6ec9 --tls-san 192.168.42.19 --node-ip 192.168.42.21 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.targe
Also random question, if we label a worker node as etcd, will it also be added to the quorum? Or it that specifically for master nodes?
@onedr0p in that case you will need to remove the --server and --token flags from the server command line.
Only servers (nodes started with k3s server
) run the apiserver and datastore (etcd). Adding labels to agents won't make them a server.
@brandond I removed --server
and --token
from one of the secondary masters. Then issued sudo systemctl daemon-reload
and then sudo systemctl restart k3s
. The service came back healthy and the node was still working. However on reboot of this node it did not want to join back.
Systemd service:
[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network.target
[Service]
Type=notify
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/k3s server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true
KillMode=process
Delegate=yes
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
[Install]
WantedBy=multi-user.target
Here is the logs for that node after reboot
Sep 21 08:16:38 k8s-master-c systemd[1]: Starting Lightweight Kubernetes...
Sep 21 08:16:38 k8s-master-c k3s[521]: time="2020-09-21T08:16:38.429345552-04:00" level=info msg="Starting k3s v1.19.2-rc2+k3s1 (d38505b1)"
Sep 21 08:16:38 k8s-master-c k3s[521]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 08:16:38 k8s-master-c k3s[521]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 08:16:38 k8s-master-c k3s[521]: peerTLS: cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file =
Sep 21 08:16:38 k8s-master-c k3s[521]: time="2020-09-21T08:16:38.447441699-04:00" level=info msg="Running kube-apiserver --advertise-address=192.168.42.22 --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=unknown --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
Sep 21 08:16:38 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:38 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:39 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:39 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:39 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:39 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:40 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:40 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:41 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:41 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:43 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:43 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:44 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:44 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:47 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:48 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:48 k8s-master-c k3s[521]: {"level":"warn","ts":"2020-09-21T08:16:48.447-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 08:16:48 k8s-master-c k3s[521]: time="2020-09-21T08:16:48.447933340-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 08:16:53 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:53 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:53 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:53 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:54 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:54 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:55 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:55 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:16:58 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:16:58 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:17:02 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:17:02 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 08:17:03 k8s-master-c k3s[521]: WARNING: 2020/09/21 08:17:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
At least it's now not complaining about certificates :)
Edit: I may try to do a manual install to see if this can be replicated. However reviewing the ansible role (PyratLabs/ansible-role-k3s) I do not see anything that stands out as to why it would not work.
@onedr0p on k8s-master-c, can you change your command from k3s server
to k3s --debug server
? It looks like etcd is not finishing startup for some reason. Did you do anything to the files on disk?
Sure here you go, the systemd file is the same but with --debug
added, here is the log after reboot.
-- Logs begin at Mon 2020-09-21 10:32:02 EDT, end at Mon 2020-09-21 13:52:32 EDT. --
Sep 21 13:51:47 k8s-master-c systemd[1]: Starting Lightweight Kubernetes...
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47-04:00" level=debug msg="Asset dir /var/lib/rancher/k3s/data/07bf5246a6ab2127234428fbf3023ce85b0376a7b946bbdaee2726b7d9a6fad8"
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47-04:00" level=debug msg="Running /var/lib/rancher/k3s/data/07bf5246a6ab2127234428fbf3023ce85b0376a7b946bbdaee2726b7d9a6fad8/bin/k3s-server [/usr/local/bin/k3s --debug server --flannel-backend none --cluster-cidr 10.42.0.0/16 --service-cidr 10.43.0.0/16 --disable servicelb --disable traefik --disable local-storage --disable metrics-server --cluster-dns 10.43.0.10 --tls-san 192.168.42.19 --node-ip 192.168.42.22 --kubelet-arg feature-gates=ExternalPolicyForExternalIP=true]"
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47.400486793-04:00" level=info msg="Starting k3s v1.19.2+k3s1 (d38505b1)"
Sep 21 13:51:47 k8s-master-c k3s[526]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 13:51:47 k8s-master-c k3s[526]: [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
Sep 21 13:51:47 k8s-master-c k3s[526]: peerTLS: cert = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.crt, key = /var/lib/rancher/k3s/server/tls/etcd/peer-server-client.key, trusted-ca = /var/lib/rancher/k3s/server/tls/etcd/peer-ca.crt, client-cert-auth = true, crl-file =
Sep 21 13:51:47 k8s-master-c k3s[526]: time="2020-09-21T13:51:47.419057703-04:00" level=info msg="Running kube-apiserver --advertise-address=192.168.42.22 --advertise-port=6443 --allow-privileged=true --anonymous-auth=false --api-audiences=unknown --authorization-mode=Node,RBAC --bind-address=127.0.0.1 --cert-dir=/var/lib/rancher/k3s/server/tls/temporary-certs --client-ca-file=/var/lib/rancher/k3s/server/tls/client-ca.crt --enable-admission-plugins=NodeRestriction --etcd-cafile=/var/lib/rancher/k3s/server/tls/etcd/server-ca.crt --etcd-certfile=/var/lib/rancher/k3s/server/tls/etcd/client.crt --etcd-keyfile=/var/lib/rancher/k3s/server/tls/etcd/client.key --etcd-servers=https://127.0.0.1:2379 --insecure-port=0 --kubelet-certificate-authority=/var/lib/rancher/k3s/server/tls/server-ca.crt --kubelet-client-certificate=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.crt --kubelet-client-key=/var/lib/rancher/k3s/server/tls/client-kube-apiserver.key --profiling=false --proxy-client-cert-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.crt --proxy-client-key-file=/var/lib/rancher/k3s/server/tls/client-auth-proxy.key --requestheader-allowed-names=system:auth-proxy --requestheader-client-ca-file=/var/lib/rancher/k3s/server/tls/request-header-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6444 --service-account-issuer=k3s --service-account-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-account-signing-key-file=/var/lib/rancher/k3s/server/tls/service.key --service-cluster-ip-range=10.43.0.0/16 --storage-backend=etcd3 --tls-cert-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.crt --tls-private-key-file=/var/lib/rancher/k3s/server/tls/serving-kube-apiserver.key"
Sep 21 13:51:47 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:47 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:47 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:48 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:48 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:48 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:49 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:49 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:50 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:50 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:51 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:51 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:52 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:52 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:56 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:56 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:51:57 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:51:57.419-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:51:57 k8s-master-c k3s[526]: time="2020-09-21T13:51:57.420272065-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:51:57 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:51:57 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:02 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:02 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:03 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:03 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:03 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:04 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:04 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:07 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:07 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:11 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:11 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:12 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:52:12.420-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:52:12 k8s-master-c k3s[526]: time="2020-09-21T13:52:12.420999478-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:52:12 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:12 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:17 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:17 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:18 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:18 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:20 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:20 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:22 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:22 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:26 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:26 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:27 k8s-master-c k3s[526]: {"level":"warn","ts":"2020-09-21T13:52:27.421-0400","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///https://127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
Sep 21 13:52:27 k8s-master-c k3s[526]: time="2020-09-21T13:52:27.421641003-04:00" level=info msg="Failed to test data store connection: context deadline exceeded"
Sep 21 13:52:29 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:29 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Sep 21 13:52:32 k8s-master-c k3s[526]: WARNING: 2020/09/21 13:52:32 grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
Edit 1: it appears after I run sudo systemctl restart k3s
the service becomes active. Maybe there is some race condition or something happening on boot.
Edit 2: it appears for my case that my systemd file was After=network.service
and NOT After=network-online.service
Sorry for missing that 🤦 The node now comes back healthy after removing --token
and --server
flags and a reboot. 🥳 Thanks for the help!
If you want to live dangerously, you can try out the fix via INSTALL_K3S_COMMIT=48ed47c4a3e420fa71c18b2ec97f13dc0659778b
The kubernetes API will still not come up until quorum is reached - this is just how etcd works. All k3s services should start successfully, and wait for a quorum of etcd members to become available, at which point the apiserver will come online.
@brandond the commit reboots successfully, everything looks good. The log is full of message I do not understand. If anyone like to check, I attached it. node1.txt
I noticed one thing: The first node takes ~10 minutes to respond to any ssh requests after a reboot. It is a reproducable behaviour. ssh is available as soon as k3s is available. Any suggestions? :/
@brandond I have install that commit and the masters are coming online after reboot. I don't see the 10 minute delay like @cubed-it mentioned.
@cubed-it that doesn't sound like a k3s issue. It sounds like something is timing out on boot and delaying network startup. Have you checked your system logs?
@brandond just in the last few minutes I have been analyzing the problem further and it seems to be the operating system. When I issue a "reboot" command, everything runs fine until "Stopping Restore / save the current clock...". Whatever that means...
So everything is good with k3s ;) Even more than good I have to say! The 1.19 etcd performance compared to 1.18 sqlite is amazing. I observe only half of the base load on the cpu.
Postscript: There were kernel, bootloader and eeporm updates pending. After I installed them all, the problem seems to have disappeared as suddenly as it appeared.
Reproduced the issue using v1.19.1-rc1+k3s1 and validated the fix using commit ID 6b11d860374f938623479b0d06aeb3c1ac87ce86.
Created cluster with 3 server nodes. Rebooted all nodes. All nodes were able to reconnect. Commands used: Server 1:
curl -sfL https://get.k3s.io | K3S_CLUSTER_INIT=true INSTALL_K3S_COMMIT=6b11d860374f938623479b0d06aeb3c1ac87ce86 sh -
Server 2 and 3:
curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=6b11d860374f938623479b0d06aeb3c1ac87ce86 sh -s - server --token=TOKEN --server=https://FIRSTNODEIP:6443
After reboot:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-172-31-10-253 Ready etcd,master 13m v1.19.2+k3s-6b11d860
ip-172-31-11-247 Ready etcd,master 13m v1.19.2+k3s-6b11d860
ip-172-31-5-59 Ready etcd,master 16m v1.19.2+k3s-6b11d860
@ShylajaDevadiga @brandond
Is there a reason why this fix (the commit ID from @ShylajaDevadiga's comment - which is indeed solving this bug) is not included in the latest v1.19.3+k3s1 release?
This issue has the milestone v1.19.3+k3s1 and is listed as closed, but it's not included in the latest
release channel (which is currently v1.19.3+k3s1) and therefore an installation with INSTALL_K3S_CHANNEL=latest
still has this issue.
Not everything makes the planned milestone. The upstream release had several CVEs, so we wanted to turn it around quickly. We've got another release planned soon (independent of upstream patches) that will include backported fixes from the master branch.
Not everything makes the planned milestone. The upstream release had several CVEs, so we wanted to turn it around quickly. We've got another release planned soon (independent of upstream patches) that will include backported fixes from the master branch.
Hi @brandond , I am using this setup Exercise I Performed , PS: I am using K3Os in Node2 and Node3, and I am facing this issue. After reboot the setup is not coming up. Can you suggest any particular for this case.
@dharmendrakariya this issue was fixed in the PR linked above. Please confirm that you are running a release that includes the fix.
@brandond okay great, but I think K3os doesn't have this patch. Because today only I faced this one.
Ah, in that case you should probably open an issue in the k3os repo so that it can be tracked there.
Oops my bad, will do that. You are right!
Environmental Info: K3s Version: k3s version v1.19.1-rc1+k3s1 (041f18f6)
Node(s) CPU architecture, OS, and Version: Linux cluster01 5.4.0-1018-raspi #20-Ubuntu SMP Sun Sep 6 05:11:16 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
Cluster Configuration: 3 masters
Describe the bug: After a fresh install, all nodes were up correctly. After complete power off of all nodes, when devices come up, nodes are not able to reconnect.
Steps To Reproduce: Installation of K3s on first master node:
On 2 others masters, installation of K3s
with the correct first node URL and token
After the power off, here is the error on first created master node
And on other master nodes
Expected behavior: After a shutdown, all clusters should reconnect. At least, one of then should start correctly
Actual behavior: Connection error to nodes
Additional context / logs: On first master, here are cyclic messages
On other masters, cyclic messages