Closed thewilli closed 2 years ago
I met the same problem
Two etcd servers do not have quorum; if you shut down one the other will exit as well due to quorum loss. See: https://etcd.io/docs/v3.3/faq/#:~:text=An%20etcd%20cluster%20needs%20a,of%20nodes%20necessary%20for%20quorum.
You should always have an odd number of servers when using etcd.
Two etcd servers do not have quorum
@brandond in this case, my questions would be
etcd
used at all, according to the docs --datastore-endpoint
defaults to sqlite
?etcd
was used, why is it setup in a HA scenario by default? All I want is a 2nd regular node running pods connecting to a single server
/ apiserver
You said you have 2 servers and 2 agents, for a total of 4 nodes. You also said that you passed --cluster-init
to the first server, to initialize an etcd cluster instead of using SQLite. Is that correct?
Might be a misunderstanding from my side. Actually what I have are two nodes in total running pods where one of them is acting as API Server (i.e. 1 k3s server
and 1 k3s agent
). I started the server with --cluster-init
because I wasn't able to join a 2nd node to the (sqlite-based) server otherwise.
And this setup as described above, crashed (and still crashes) all the time.
But I use a etcd node, the problem still cannot be solved
I started the server with --cluster-init because I wasn't able to join a 2nd node to the (sqlite-based) server otherwise.
The issues you're describing all suggest that you in fact have two etcd servers. Can you provide the output of kubectl get node -o wide
on the server, as well as systemctl list-units k3s*
from both hosts?
$ kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k3s01 Ready control-plane,etcd,master 12d v1.22.6+k3s1 46.4.XX.XXX 46.4.XX.XXX Ubuntu 20.04.4 LTS 5.4.0-100-generic containerd://1.5.9-k3s1
k3s02 NotReady <none> 12d v1.22.6+k3s1 144.76.XXX.XX <none> Ubuntu 20.04.4 LTS 5.4.0-100-generic containerd://1.5.9-k3s1
# master
$ systemctl list-units k3s* --all
UNIT LOAD ACTIVE SUB DESCRIPTION
k3s.service loaded active running Lightweight Kubernetes
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.
# 2nd node
$ systemctl list-units k3s* --all
UNIT LOAD ACTIVE SUB DESCRIPTION
k3s-agent.service loaded active running Lightweight Kubernetes
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.
Here's how it should work:
The errors on your server prior to the crash indicate that datastore latency was high. High datastore latencies will lead to a crash of the entire k3s server process if they exceed ~10 seconds. I see some in the logs that are as high as 3 seconds, but I suspect that they were higher at other times:
Trace[1147257116]: [3.304232898s] [3.304232898s] END
The most common cause of high datastore latency is insufficient disk throughput. Embedded etcd should be used with SSD storage or better, preferably not sharing the same block device as your workload if your workload is disk-intensive.
Hey I am having this kind of issue myself with k3s v1.22.7.
What I am trying to do is to migrate from a k3os single master installation to a MicroOS single master installation.
The process I have done to achieve this is to add the new master to the cluster, poweroff the old master, run etcdctl member remove
from the new master and run kubectl delete node
from the new master. The process worked correctly but now I have the same panic unreachable
error.
Now I have some kind of a clone of the running k3s env, so I can fiddle with that and not break my "PROD" env (this is a self-hosting homelab)
Storage is not an issue as I was running fine with k3os, which uses v1.22.2 k3s. I had some request similar to "apply took too long", but nothing greater than 250ms (it's running on a 2-mirror ZFS backend storage).
I could help providing some logs or try any new version, this seems to be some kind of inifinite loop or something on k3s, because the time on those Traces begins to "accumulate".
You want to use --cluster-reset on the new node to reset etcd back to a single node cluster. Either that or take a snapshot on the existing node and then restore it on the new one.
Omitted that 😅 ran that and rebooted the master VM after running k3s --cluster-init
. Should that work?
I did not see any significant logs about quorum after that
Environmental Info: K3s Version:
Node(s) CPU architecture, OS, and Version:
Dedicated server (i.e. no VM) with Intel Xeon E3-1246V3 and 32 GB DDR3 RAM.
Cluster Configuration:
2 servers, 2 agents at first (no HA setup), for testing purposes I permanently shutdown the 2nd node with no effect on the issue.
Describe the bug:
k3s crashes constantly, while the time it takes for it to crash is unrelated to workload (there are only some test pods deployed), and is varying from a few minutes to 10+ hours. It all ends with
Seems related to #2059, but I am neither using an SD card, nor do I have any performance issues on that server. I even tried to put
/etc/rancher
on an SSD. No change.Steps To Reproduce:
server --cluster-init --disable traefik --disable servicelb
. Everything else was left to default.Expected behavior:
--cluster-init
to be able to join a 2nd node, not for any kind of HA.Actual behavior:
Additional context / logs:
Backporting