Open GeorgeJose7 opened 1 week ago
Have you checked the etcd pod logs under /var/log/pods
on the nodes to see why etcd is not available? The usual cause of this is insufficient disk performance causing etcd requests from the apiserver to time out.
Have checking the logs kube-server-1
look like it's taking longer than expected to read from etcd. Here are the logs
2024-11-06T18:55:51.9858767512 stderr F {"level":"info",
"ts": "2024-11-06T18:55:51.9838642"
"caller"
: "traceutil/trace.go:171".
tail": "{range_begin:/registry/health; range_end:; 3"
"duration": "2.000409338s"
"start": "2024-11-06T18:55:49. 9834252"
"msg": "trace [1844191136] range"
"de
"end": "2024-11-06T18:55:51. 9838352",
',"steps"
: ["trace [1844191136] 'agreement among raft nodes before linearized reading'
(duration: 2.000157497s) "], "step_count": 1}
2024-11-06T18:55:51.9859633712 stderr F {"level":"warn",
"ts": "2024-11-0618:55:51.9840762"', "caller":"v3rpc/interceptor.go:197","msg":"request stats", "start time
": "2024-11-06T18:55:49.9833952",
"time spent":"2.000649112s",
'remote":"127.9.9.1:34022"
"response type":"/etcdserverpb.Kv/Range"
:18, "response count":0, "response size":0,
"request content": "key: "/registry/health!" "3
"request count":0,'
"request size"
2024-11-06T18:55:52.3471158642 stderr F {"level":"warn".
"ts": "2024-11-06T18:55:52.3463692", "caller":"etcdserver/v3_server.go:908" , "msg"': "waiting for ReadIndex r
esponse
took too long, retrying".
', "sent-request-id": 15287912030257974253, "retry-timeout": "500ms"}
2024-11-06T18:55:52.3962091392 stderr F {"level":"warn".
"ts": "2024-11-06T18:55:52.3948082"'
"caller":"etcdserver/util.go:170", "msg": "apply request took too long"
"took":"1.999732478s"
, "expected-duration": "100ms",
"prefix": "read-only range
"request": "key:\"/registry/health\" "
eeded"3
"response":''''
"error": "context deadline exc
2024-11-0618:55:52.3976705592 stderr F {"level":"info", "ts": "2024-11-06T18:55:52.396752", "caller":"traceutil/trace.go:171", "msg":"trace [1249616425] range", "det ail":"{range_begin:/registry/health; range_end:; 3",
"duration"':"2.001882486s"
"start": "2024-11-06T18:55:50.3946972",
"end": "2024-11-0618:55:52.396582"
"steps": [
"trace [1249616425] 'agreement among raft nodes before linearized reading'
(duration: 1.999718624s)"], "step_count": 13
2024-11-06T18:55:52.399843562Z stderr F {"level":"warn"
"ts": "2024-11-0618:55:52.399212"
, "caller": "v3rpc/interceptor.go:197", "msg": "request stats", "start time"
12 response out 50 restone sie est content enetreat 6, response type" "/etedserverb KV/Range".
"request count":0,'
"request size":
2024-11-06T18:55:52.848991702Z stderr F {"level":"warn"
"ts": "2024-11-06T18:55:52.8478982"
, "caller":"etcdserver/v3_server.go:908"
, "msg": "waiting for ReadIndex r
esponse took too long, retrying"
', "sent-request-id": 15287912030257974253,"retry-timeout": "500ms"}
The error message from the other servers are
2024-11-06T18:51:40.4932702552 stderr F {"level": "warn", 1thy status"
50und-tr tripper-name ROUN "IP AT SA 7em0--B18:5149. 02716803167a379', a the godia Status Bodia 0- 9:2dete on ectea
network is unreachable"}
2024-11-06T18:51:40.5082242112 stderr F {"level": "warn",
Ithy
status"
"round-tripper-name": "ROUND_TRIPPER RAFT MÉSSAGE".
STenote-per-51:40.581958, 60429/ Mth9.56916 Strun dete conectea
network is unreachable"
"remote-peer-id":"cdda7e363b3bd429"
2024-11-06T18:51:41.7308565742 stdern F {"level":"info", "ts":"2024-11-06T18:51:41.7296272"', "logger": "raft", "caller":"etcdserver/zap_raft.go:77"', "msg":"5105870fd
0c8d209 is starting
a new election at term 4"3
2024-11-06T18:51:41.7309338092 stdern F {"level"':"info","ts":"2024-11-06T18:51:41.7297512"', "logger": "raft", "caller": "etcdserver/zap_raft.go:77", "msg":"5105870fd
0c8d209 became pre-candidate
at term 4"3
2024-11-06T18:51:41.7309487632 stderr F {"level"':"info","ts":"2024-11-06T18:51:41.7298022"', "logger":"raft", "caller":"etcdserver/zap_raft.go:77", "msg":"5105870fd
0c8d209 received MsgPreVoteResp from 5105870fd0c8d209 at term 4'3
2024-11-06T18:51:41.7309609392 stderp F §"level":"info".
, "ts":"2024-11-06T18:51:41.7298382", "logger": "raft", "caller":"etcdserver/zap_raft.go:77"',"msg":"5105870fd
0c8d209
[logterm: 4,
"Index: 620934) sent Ms&Prevote request to cdda7es63bsbd429 at term 4"3
2024-11-06T18:51:41.7309705052 stderr F {"level":"info"
"ts": "2024-11-06T18:51:41.7298592"
,"logger": "raft", "caller": "etcdserver/zap_raft.go:77", "msg": "5105870fd
0c8d209 [logterm: 4, index: 620934] sent MsgPreVote request to fc2f1b8b3167a379 at term 4"3
This is a disk performance issue or am I still missing something?
Insufficient disk IO is the most frequent cause, yes. It is also possible that the network between nodes is unreliable, or the etcd process is CPU starved, but these are less likely.
What does iostat -cxd 5
show when this is going on?
I am trying to create a rke2 HA cluster behind Nginx-reverse load balancer. The cluster spins up and I am able to install app on it as well. After 24 hours the Etcd the Etcd becomes unavailable and the cluster becomes unhealthy.
What am I missing here?
I am running the cluster on Hetzner cloud.
Error Message
below is my configuration
Nginx - loadbalancer
RKE2 config
/etc/rancher/rke2/config.yaml
kube-server-1
kube-server-2
kube-server-3