k3s crashes when quorum is lost

thewilli commented 2 years ago

Environmental Info: K3s Version:

k3s version v1.22.6+k3s1 (3228d9cb)
go version go1.16.10

Node(s) CPU architecture, OS, and Version:

$ uname -a
Linux k3s01 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Dedicated server (i.e. no VM) with Intel Xeon E3-1246V3 and 32 GB DDR3 RAM.

Cluster Configuration:

2 servers, 2 agents at first (no HA setup), for testing purposes I permanently shutdown the 2nd node with no effect on the issue.

Describe the bug:

k3s crashes constantly, while the time it takes for it to crash is unrelated to workload (there are only some test pods deployed), and is varying from a few minutes to 10+ hours. It all ends with

E0303 23:01:05.695135 1326491 runtime.go:76] Observed a panic: F0303 23:01:05.694831 1326491 controllermanager.go:234] leaderelection lost

Seems related to #2059, but I am neither using an SD card, nor do I have any performance issues on that server. I even tried to put /etc/rancher on an SSD. No change.

Steps To Reproduce:

Installed K3s: with server --cluster-init --disable traefik --disable servicelb. Everything else was left to default.

Expected behavior:

SQLite instead of etcd as it would be the default configuration
- maybe this is a misunderstanding of me and not related to the issue. I only used --cluster-init to be able to join a 2nd node, not for any kind of HA.
k3s keeps running without crashing

Actual behavior:

etcd is used as db backend
k3s constantly crashes

Additional context / logs:

I0303 23:01:03.902393 1326491 trace.go:205] Trace[1890232841]: "Update" url:/api/v1/namespaces/nginx/configmaps/ingress-controller-leader,user-agent:nginx-ingress-controller/v1.1.1 (linux/amd64) ingress-n
ginx/a17181e43ec85534a6fea968d95d019c5a4bc8cf,audit-id:06c7a1c8-33c4-4350-9d94-7fe09ec01e42,client:10.42.0.11,accept:application/json, */*,protocol:HTTP/1.1 (03-Mar-2022 23:00:58.611) (total time: 5290ms)
:
Trace[1890232841]: ---"Object stored in database" 5290ms (23:01:03.902)
Trace[1890232841]: [5.290399719s] [5.290399719s] END
I0303 23:01:03.902409 1326491 trace.go:205] Trace[465806019]: "Update" url:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/k3smaster,user-agent:k3s/v1.22.6+k3s1 (linux/amd64) kubernetes/322
8d9c,audit-id:6b4ec017-d22c-4907-8ff2-addf66858bb4,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf,application/json,protocol:HTTP/1.1 (03-Mar-2022 23:00:59.548) (total time: 4353ms):
Trace[465806019]: ---"Object stored in database" 4353ms (23:01:03.902)
Trace[465806019]: [4.353431945s] [4.353431945s] END
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"3.70886495s","expected-duration":"100ms","prefix":"read-only range ","requ
est":"key:\"/registry/flowschemas/\" range_end:\"/registry/flowschemas0\" count_only:true ","response":"range_response_count:0 size:8"}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"2.531992439s","expected-duration":"100ms","prefix":"read-only range ","req
uest":"key:\"/registry/apiregistration.k8s.io/apiservices/\" range_end:\"/registry/apiregistration.k8s.io/apiservices0\" count_only:true ","response":"range_response_count:0 size:8"}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"1.24104064s","expected-duration":"100ms","prefix":"read-only range ","requ
est":"key:\"/registry/namespaces/default\" ","response":"range_response_count:1 size:331"}
{"level":"info","ts":"2022-03-03T23:01:03.924+0100","caller":"traceutil/trace.go:171","msg":"trace[1741222523] range","detail":"{range_begin:/registry/apiregistration.k8s.io/apiservices/; range_end:/regis
try/apiregistration.k8s.io/apiservices0; response_count:0; response_revision:1638423; }","duration":"2.532083716s","start":"2022-03-03T23:01:01.392+0100","end":"2022-03-03T23:01:03.924+0100","steps":["tra
ce[1741222523] 'agreement among raft nodes before linearized reading'  (duration: 2.531987688s)"],"step_count":1}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"3.311251468s","expected-duration":"100ms","prefix":"read-only range ","req
uest":"key:\"/registry/configmaps/kube-system/k3s\" ","response":"range_response_count:1 size:515"}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"2.435434702s","expected-duration":"100ms","prefix":"read-only range ","req
uest":"key:\"/registry/poddisruptionbudgets/\" range_end:\"/registry/poddisruptionbudgets0\" count_only:true ","response":"range_response_count:0 size:6"}
{"level":"info","ts":"2022-03-03T23:01:03.924+0100","caller":"traceutil/trace.go:171","msg":"trace[283331149] range","detail":"{range_begin:/registry/configmaps/kube-system/k3s; range_end:; response_count
:1; response_revision:1638423; }","duration":"3.311323898s","start":"2022-03-03T23:01:00.613+0100","end":"2022-03-03T23:01:03.924+0100","steps":["trace[283331149] 'agreement among raft nodes before linear
ized reading'  (duration: 3.311210885s)"],"step_count":1}
{"level":"info","ts":"2022-03-03T23:01:03.924+0100","caller":"traceutil/trace.go:171","msg":"trace[1914805875] range","detail":"{range_begin:/registry/poddisruptionbudgets/; range_end:/registry/poddisrupt
ionbudgets0; response_count:0; response_revision:1638423; }","duration":"2.435497488s","start":"2022-03-03T23:01:01.489+0100","end":"2022-03-03T23:01:03.924+0100","steps":["trace[1914805875] 'agreement am
ong raft nodes before linearized reading'  (duration: 2.435411147s)"],"step_count":1}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"3.64682536s","expected-duration":"100ms","prefix":"read-only range ","requ
est":"key:\"/registry/clusterrolebindings/\" range_end:\"/registry/clusterrolebindings0\" count_only:true ","response":"range_response_count:0 size:8"}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-03-03T23:01:00.613+0100","time spent":"3.311412098s","remote":"127.0.0.1:47
390","response type":"/etcdserverpb.KV/Range","request count":0,"request size":38,"response count":1,"response size":539,"request content":"key:\"/registry/configmaps/kube-system/k3s\" "}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-03-03T23:01:01.489+0100","time spent":"2.435559796s","remote":"127.0.0.1:47
454","response type":"/etcdserverpb.KV/Range","request count":0,"request size":68,"response count":0,"response size":30,"request content":"key:\"/registry/poddisruptionbudgets/\" range_end:\"/registry/pod
disruptionbudgets0\" count_only:true "}
{"level":"info","ts":"2022-03-03T23:01:03.924+0100","caller":"traceutil/trace.go:171","msg":"trace[239955694] range","detail":"{range_begin:/registry/clusterrolebindings/; range_end:/registry/clusterroleb
indings0; response_count:0; response_revision:1638423; }","duration":"3.646980443s","start":"2022-03-03T23:01:00.277+0100","end":"2022-03-03T23:01:03.924+0100","steps":["trace[239955694] 'agreement among
raft nodes before linearized reading'  (duration: 3.646786741s)"],"step_count":1}
{"level":"info","ts":"2022-03-03T23:01:03.924+0100","caller":"traceutil/trace.go:171","msg":"trace[2021234038] range","detail":"{range_begin:/registry/namespaces/default; range_end:; response_count:1; res
ponse_revision:1638423; }","duration":"1.241140712s","start":"2022-03-03T23:01:02.683+0100","end":"2022-03-03T23:01:03.924+0100","steps":["trace[2021234038] 'agreement among raft nodes before linearized r
eading'  (duration: 1.241002791s)"],"step_count":1}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-03-03T23:01:00.277+0100","time spent":"3.647049345s","remote":"127.0.0.1:47
470","response type":"/etcdserverpb.KV/Range","request count":0,"request size":66,"response count":62,"response size":32,"request content":"key:\"/registry/clusterrolebindings/\" range_end:\"/registry/clu
sterrolebindings0\" count_only:true "}
{"level":"info","ts":"2022-03-03T23:01:03.924+0100","caller":"traceutil/trace.go:171","msg":"trace[874156255] range","detail":"{range_begin:/registry/flowschemas/; range_end:/registry/flowschemas0; respon
se_count:0; response_revision:1638423; }","duration":"3.708927831s","start":"2022-03-03T23:01:00.215+0100","end":"2022-03-03T23:01:03.924+0100","steps":["trace[874156255] 'agreement among raft nodes befor
e linearized reading'  (duration: 3.708844154s)"],"step_count":1}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-03-03T23:01:00.215+0100","time spent":"3.709294012s","remote":"127.0.0.1:47
494","response type":"/etcdserverpb.KV/Range","request count":0,"request size":50,"response count":12,"response size":32,"request content":"key:\"/registry/flowschemas/\" range_end:\"/registry/flowschemas
0\" count_only:true "}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-03-03T23:01:01.392+0100","time spent":"2.532166303s","remote":"127.0.0.1:47
520","response type":"/etcdserverpb.KV/Range","request count":0,"request size":96,"response count":32,"response size":32,"request content":"key:\"/registry/apiregistration.k8s.io/apiservices/\" range_end:
\"/registry/apiregistration.k8s.io/apiservices0\" count_only:true "}
{"level":"warn","ts":"2022-03-03T23:01:03.925+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"3.303283301s","expected-duration":"100ms","prefix":"read-only range ","req
uest":"key:\"/registry/leases/kube-system/kube-scheduler\" ","response":"range_response_count:1 size:473"}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-03-03T23:01:02.683+0100","time spent":"1.24134932s","remote":"127.0.0.1:473
92","response type":"/etcdserverpb.KV/Range","request count":0,"request size":30,"response count":1,"response size":355,"request content":"key:\"/registry/namespaces/default\" "}
{"level":"info","ts":"2022-03-03T23:01:03.925+0100","caller":"traceutil/trace.go:171","msg":"trace[945723289] range","detail":"{range_begin:/registry/leases/kube-system/kube-scheduler; range_end:; respons
e_count:1; response_revision:1638423; }","duration":"3.303341904s","start":"2022-03-03T23:01:00.621+0100","end":"2022-03-03T23:01:03.925+0100","steps":["trace[945723289] 'agreement among raft nodes before
 linearized reading'  (duration: 3.302689988s)"],"step_count":1}
I0303 23:01:03.925191 1326491 trace.go:205] Trace[1101540281]: "Get" url:/api/v1/namespaces/kube-system/configmaps/k3s,user-agent:deploy@k3smaster/v1.22.6+k3s1 (linux/amd64) k3s/3228d9cb,audit-id:049d2f85
-2155-4556-8cc3-10edc8f73f97,client:127.0.0.1,accept:application/json, */*,protocol:HTTP/2.0 (03-Mar-2022 23:01:00.612) (total time: 3312ms):
Trace[1101540281]: ---"About to write a response" 3312ms (23:01:03.925)
Trace[1101540281]: [3.312572523s] [3.312572523s] END
{"level":"warn","ts":"2022-03-03T23:01:03.925+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2022-03-03T23:01:00.621+0100","time spent":"3.30341918s","remote":"127.0.0.1:474
28","response type":"/etcdserverpb.KV/Range","request count":0,"request size":45,"response count":1,"response size":497,"request content":"key:\"/registry/leases/kube-system/kube-scheduler\" "}
{"level":"warn","ts":"2022-03-03T23:01:03.924+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"143.486417ms","expected-duration":"100ms","prefix":"read-only range ","req
uest":"key:\"k3s/etcd/learnerProgress\" ","response":"range_response_count:0 size:6"}
{"level":"info","ts":"2022-03-03T23:01:03.925+0100","caller":"traceutil/trace.go:171","msg":"trace[1062307556] range","detail":"{range_begin:k3s/etcd/learnerProgress; range_end:; response_count:0; respons
e_revision:1638423; }","duration":"144.369698ms","start":"2022-03-03T23:01:03.780+0100","end":"2022-03-03T23:01:03.925+0100","steps":["trace[1062307556] 'agreement among raft nodes before linearized readi
ng'  (duration: 143.463486ms)"],"step_count":1}
I0303 23:01:03.925756 1326491 trace.go:205] Trace[909732764]: "Get" url:/api/v1/namespaces/default,user-agent:k3s/v1.22.6+k3s1 (linux/amd64) kubernetes/3228d9c,audit-id:3669ae29-bf43-4011-8b8d-57a053e10af
1,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (03-Mar-2022 23:01:02.682) (total time: 1242ms):
Trace[909732764]: ---"About to write a response" 1242ms (23:01:03.925)
Trace[909732764]: [1.242868768s] [1.242868768s] END
I0303 23:01:03.925814 1326491 trace.go:205] Trace[1147257116]: "Get" url:/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler,user-agent:k3s/v1.22.6+k3s1 (linux/amd64) kubernetes/3228
d9c/leader-election,audit-id:7247e955-428e-47a6-8d0e-7562526e68d0,client:127.0.0.1,accept:application/vnd.kubernetes.protobuf, */*,protocol:HTTP/2.0 (03-Mar-2022 23:01:00.621) (total time: 3304ms):
Trace[1147257116]: ---"About to write a response" 3304ms (23:01:03.925)
Trace[1147257116]: [3.304232898s] [3.304232898s] END
{"level":"info","ts":"2022-03-03T23:01:04.651+0100","caller":"traceutil/trace.go:171","msg":"trace[2044700418] transaction","detail":"{read_only:false; response_revision:1638428; number_of_response:1; }",
"duration":"122.347007ms","start":"2022-03-03T23:01:04.529+0100","end":"2022-03-03T23:01:04.651+0100","steps":["trace[2044700418] 'process raft request'  (duration: 52.952166ms)","trace[2044700418] 'compa
re'  (duration: 69.25724ms)"],"step_count":2}
{"level":"info","ts":"2022-03-03T23:01:04.651+0100","caller":"traceutil/trace.go:171","msg":"trace[1355148669] linearizableReadLoop","detail":"{readStateIndex:1772569; appliedIndex:1772568; }","duration":
"121.393949ms","start":"2022-03-03T23:01:04.530+0100","end":"2022-03-03T23:01:04.651+0100","steps":["trace[1355148669] 'read index received'  (duration: 51.943202ms)","trace[1355148669] 'applied index is
now lower than readState.Index'  (duration: 69.448717ms)"],"step_count":2}
{"level":"warn","ts":"2022-03-03T23:01:04.652+0100","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"121.481576ms","expected-duration":"100ms","prefix":"read-only range ","req
uest":"key:\"/registry/leases/kube-system/cert-manager-controller\" ","response":"range_response_count:1 size:512"}
{"level":"info","ts":"2022-03-03T23:01:04.652+0100","caller":"traceutil/trace.go:171","msg":"trace[1680558781] range","detail":"{range_begin:/registry/leases/kube-system/cert-manager-controller; range_end
:; response_count:1; response_revision:1638428; }","duration":"121.534885ms","start":"2022-03-03T23:01:04.530+0100","end":"2022-03-03T23:01:04.652+0100","steps":["trace[1680558781] 'agreement among raft n
odes before linearized reading'  (duration: 121.455739ms)"],"step_count":1}
E0303 23:01:05.694660 1326491 leaderelection.go:367] Failed to update lock: Put "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cloud-controller-manager?timeout=5s": cont
ext deadline exceeded
I0303 23:01:05.694778 1326491 leaderelection.go:283] failed to renew lease kube-system/cloud-controller-manager: timed out waiting for the condition
F0303 23:01:05.694831 1326491 controllermanager.go:234] leaderelection lost
I0303 23:01:05.694850 1326491 event.go:291] "Event occurred" object="kube-system/cloud-controller-manager" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="k
3s01_92186894-0bff-4794-a2c2-47c338dbfb35 stopped leading"
E0303 23:01:05.694786 1326491 finisher.go:175] FinishRequest: post-timeout activity - time-elapsed: 100.494µs, panicked: false, err: context canceled, panic-reason: <nil>
E0303 23:01:05.695126 1326491 writers.go:117] apiserver was unable to write a JSON response: http: Handler timeout
E0303 23:01:05.695135 1326491 runtime.go:76] Observed a panic: F0303 23:01:05.694831 1326491 controllermanager.go:234] leaderelection lost

goroutine 17548 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x4699e40, 0xc00d6a9120)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0x95
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x86
panic(0x4699e40, 0xc00d6a9120)
        /usr/local/go/src/runtime/panic.go:965 +0x1b9
k8s.io/klog/v2.(*loggingT).output(0x7e29f60, 0xc000000003, 0x0, 0x0, 0xc0000bd2d0, 0x0, 0x668be95, 0x14, 0xea, 0x0)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:970 +0x805
k8s.io/klog/v2.(*loggingT).printf(0x7e29f60, 0x3, 0x0, 0x0, 0x0, 0x0, 0x4fab0d3, 0x13, 0x0, 0x0, ...)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:753 +0x19a
k8s.io/klog/v2.Fatalf(...)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/klog/v2/klog.go:1495
k8s.io/cloud-provider/app.Run.func3()
        /go/src/github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app/controllermanager.go:234 +0x8f
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run.func1(0xc0078e0fc0)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:203 +0x29
k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run(0xc0078e0fc0, 0x59ae630, 0xc00bd5da00)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:213 +0x167
k8s.io/client-go/tools/leaderelection.RunOrDie(0x59ae630, 0xc000122008, 0x59d5c28, 0xc00faeeb40, 0x37e11d600, 0x2540be400, 0x77359400, 0xc00aa7eb60, 0x5343c40, 0x0, ...)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:226 +0x9f
k8s.io/cloud-provider/app.leaderElectAndRun(0xc009e61760, 0xc0123eed50, 0x2c, 0xc004049650, 0x4f73977, 0x6, 0x4fc4b4f, 0x18, 0xc00aa7eb60, 0x5343c40, ...)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app/controllermanager.go:465 +0x345
created by k8s.io/cloud-provider/app.Run
        /go/src/github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app/controllermanager.go:217 +0x8cb
panic: unreachable

goroutine 17548 [running]:
k8s.io/cloud-provider/app.leaderElectAndRun(0xc009e61760, 0xc0123eed50, 0x2c, 0xc004049650, 0x4f73977, 0x6, 0x4fc4b4f, 0x18, 0xc00aa7eb60, 0x5343c40, ...)
        /go/src/github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app/controllermanager.go:475 +0x365
created by k8s.io/cloud-provider/app.Run
        /go/src/github.com/rancher/k3s/vendor/k8s.io/cloud-provider/app/controllermanager.go:217 +0x8cb

Backporting

[ ] Needs backporting to older releases

bearTao commented 2 years ago

I met the same problem

brandond commented 2 years ago

Two etcd servers do not have quorum; if you shut down one the other will exit as well due to quorum loss. See: https://etcd.io/docs/v3.3/faq/#:~:text=An%20etcd%20cluster%20needs%20a,of%20nodes%20necessary%20for%20quorum.

You should always have an odd number of servers when using etcd.

thewilli commented 2 years ago

Two etcd servers do not have quorum

@brandond in this case, my questions would be

Why is etcd used at all, according to the docs --datastore-endpoint defaults to sqlite?
If etcd was used, why is it setup in a HA scenario by default? All I want is a 2nd regular node running pods connecting to a single server / apiserver

brandond commented 2 years ago

You said you have 2 servers and 2 agents, for a total of 4 nodes. You also said that you passed --cluster-init to the first server, to initialize an etcd cluster instead of using SQLite. Is that correct?

thewilli commented 2 years ago

Might be a misunderstanding from my side. Actually what I have are two nodes in total running pods where one of them is acting as API Server (i.e. 1 k3s server and 1 k3s agent). I started the server with --cluster-init because I wasn't able to join a 2nd node to the (sqlite-based) server otherwise.

And this setup as described above, crashed (and still crashes) all the time.

bearTao commented 2 years ago

But I use a etcd node, the problem still cannot be solved

brandond commented 2 years ago

I started the server with --cluster-init because I wasn't able to join a 2nd node to the (sqlite-based) server otherwise.

The issues you're describing all suggest that you in fact have two etcd servers. Can you provide the output of kubectl get node -o wide on the server, as well as systemctl list-units k3s* from both hosts?

thewilli commented 2 years ago

$ kubectl get node -o wide

NAME    STATUS     ROLES                       AGE   VERSION        INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k3s01   Ready      control-plane,etcd,master   12d   v1.22.6+k3s1   46.4.XX.XXX      46.4.XX.XXX   Ubuntu 20.04.4 LTS   5.4.0-100-generic   containerd://1.5.9-k3s1
k3s02   NotReady   <none>                      12d   v1.22.6+k3s1   144.76.XXX.XX    <none>        Ubuntu 20.04.4 LTS   5.4.0-100-generic   containerd://1.5.9-k3s1

# master
$ systemctl list-units k3s* --all
  UNIT        LOAD   ACTIVE SUB     DESCRIPTION
  k3s.service loaded active running Lightweight Kubernetes

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.

# 2nd node
$ systemctl list-units k3s* --all
  UNIT              LOAD   ACTIVE SUB     DESCRIPTION
  k3s-agent.service loaded active running Lightweight Kubernetes

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed.
To show all installed unit files use 'systemctl list-unit-files'.

brandond commented 2 years ago

Here's how it should work:

If you want high availability with embedded etcd, you need at least 3 servers
You can join an agent regardless of whether the server uses sqlite, sql, or etcd.
When the server is unavailable, the agent will still run, but the workload will not be managed (beyond basic health checking and container restarts) until the apiserver comes back.

The errors on your server prior to the crash indicate that datastore latency was high. High datastore latencies will lead to a crash of the entire k3s server process if they exceed ~10 seconds. I see some in the logs that are as high as 3 seconds, but I suspect that they were higher at other times:

Trace[1147257116]: [3.304232898s] [3.304232898s] END

The most common cause of high datastore latency is insufficient disk throughput. Embedded etcd should be used with SSD storage or better, preferably not sharing the same block device as your workload if your workload is disk-intensive.

alexdrl commented 2 years ago

Hey I am having this kind of issue myself with k3s v1.22.7.

What I am trying to do is to migrate from a k3os single master installation to a MicroOS single master installation. The process I have done to achieve this is to add the new master to the cluster, poweroff the old master, run etcdctl member remove from the new master and run kubectl delete node from the new master. The process worked correctly but now I have the same panic unreachable error.

Now I have some kind of a clone of the running k3s env, so I can fiddle with that and not break my "PROD" env (this is a self-hosting homelab)

Storage is not an issue as I was running fine with k3os, which uses v1.22.2 k3s. I had some request similar to "apply took too long", but nothing greater than 250ms (it's running on a 2-mirror ZFS backend storage).

I could help providing some logs or try any new version, this seems to be some kind of inifinite loop or something on k3s, because the time on those Traces begins to "accumulate".

brandond commented 2 years ago

You want to use --cluster-reset on the new node to reset etcd back to a single node cluster. Either that or take a snapshot on the existing node and then restore it on the new one.

alexdrl commented 2 years ago

Omitted that 😅 ran that and rebooted the master VM after running k3s --cluster-init. Should that work? I did not see any significant logs about quorum after that

k3s-io / k3s

k3s crashes when quorum is lost #5210