Closed riuvshyn closed 4 months ago
UPD: These errors during triggering snapshot creation start appearing when DB grows to 500MB with database 441MB this doesn't not happen and snapshot creation goes through without issues.
Also, creating snapshots of 1.6G DB with rke2 cli works fine without any issues:
root@ip-172-23-74-105:~# rke2 etcd-snapshot save
INFO[0032] Snapshot on-demand-ip-172-23-74-105-1721661041 saved.
INFO[0032] Snapshot on-demand-ip-172-23-74-105-1721661041 saved.
It really sounds like you're running out of disk space, and the database files on disk are getting corrupted. The errors you're showing are regarding the internal checkpoint snapshots within the database, NOT the snapshots that are taken for backups. I'm also confused why RKE2 is being restarted when you try to save a snapshot. All of these things point to rke2 crashing due to issues with the node, and then etcd crashing AGAIN on startup due to the files being corrupt.
Can you confirm how much free space you have on the node in question? What kind of storage are you using here?
Thanks for looking in to this @brandond
It really sounds like you're running out of disk space, and the database files on disk are getting corrupted. The errors you're showing are regarding the internal checkpoint snapshots within the database, NOT the snapshots that are taken for backups. I'm also confused why RKE2 is being restarted when you try to save a snapshot. All of these things point to rke2 crashing due to issues with the node, and then etcd crashing AGAIN on startup due to the files being corrupt.
This indeed sounds about right, I have thought that rke2-server restart is expected thing to happen during creating snapshot, if it is not then I clearly have issue going on here... btw I think rke2-server is not restarting when I create snapshot with database size <400MB and everything works fine.
There is plenty of space available:
Filesystem Size Used Avail Use% Mounted on
/dev/root 146G 16G 131G 11% /
I have default configuration for etcd, so wal files are stored in: /var/lib/rancher/rke2/server/db/etcd/member/wal
What I have also noticed, I saw in rke2-server logs it is running defrag which fails sometimes and I have tried to run it with etcdctl
and I see pretty similar picture where with DB size <400MB defgrad works fine:
root@ip-172-23-70-32:~# /opt/kubernetes-wise/etcd/bin/etcdctl-tw defrag --cluster true
Executing in container: 2b95eda8b86d3315625bd6a18347befbe2f9683920ae95dc17d4f699a4bdfed9
-----------------------
Finished defragmenting etcd member[https://172.23.72.77:2379]
Finished defragmenting etcd member[https://172.23.71.127:2379]
Finished defragmenting etcd member[https://172.23.70.32:2379]
Finished defragmenting etcd member[https://172.23.72.109:2379]
Finished defragmenting etcd member[https://172.23.70.159:2379]
Finished defragmenting etcd member[https://172.23.71.196:2379]
root@ip-172-23-70-32:~# /opt/kubernetes-wise/etcd/bin/etcdctl-tw endpoint-status
Executing in container: 2b95eda8b86d3315625bd6a18347befbe2f9683920ae95dc17d4f699a4bdfed9
-----------------------
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.23.72.77:2379 | 3090a09999c2130b | 3.5.13 | 387 MB | true | false | 2 | 213186 | 213186 | |
| https://172.23.71.127:2379 | 54ba414cb1ac9001 | 3.5.13 | 387 MB | false | false | 2 | 213187 | 213187 | |
| https://172.23.70.32:2379 | 68c823547be4c89c | 3.5.13 | 388 MB | false | false | 2 | 213187 | 213187 | |
| https://172.23.72.109:2379 | c64eed6dfe68e21d | 3.5.13 | 370 MB | false | false | 2 | 213188 | 213188 | |
| https://172.23.70.159:2379 | d8ed66ceb9e2f7af | 3.5.13 | 333 MB | false | false | 2 | 213188 | 213188 | |
| https://172.23.71.196:2379 | f0e7829c9749c886 | 3.5.13 | 333 MB | false | false | 2 | 213189 | 213189 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
But when DB size grows more it start timing out:
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.23.72.77:2379 | 3090a09999c2130b | 3.5.13 | 610 MB | true | false | 2 | 381990 | 381990 | |
| https://172.23.71.127:2379 | 54ba414cb1ac9001 | 3.5.13 | 615 MB | false | false | 2 | 381990 | 381990 | |
| https://172.23.70.32:2379 | 68c823547be4c89c | 3.5.13 | 598 MB | false | false | 2 | 381990 | 381990 | |
| https://172.23.72.109:2379 | c64eed6dfe68e21d | 3.5.13 | 607 MB | false | false | 2 | 381990 | 381990 | |
| https://172.23.70.159:2379 | d8ed66ceb9e2f7af | 3.5.13 | 600 MB | false | false | 2 | 381990 | 381990 | |
| https://172.23.71.196:2379 | f0e7829c9749c886 | 3.5.13 | 596 MB | false | false | 2 | 381990 | 381990 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
root@ip-172-23-70-32:~# /opt/kubernetes-wise/etcd/bin/etcdctl-tw defrag --cluster true
Executing in container: 2b95eda8b86d3315625bd6a18347befbe2f9683920ae95dc17d4f699a4bdfed9
-----------------------
{"level":"warn","ts":"2024-07-22T18:29:24.112434Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.72.77:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:29.11329Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.71.127:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:34.117321Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.70.32:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:39.121301Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.72.109:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:44.1253Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.70.159:2379] (context deadline exceeded)
{"level":"warn","ts":"2024-07-22T18:29:49.129294Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000436000/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to defragment etcd member[https://172.23.71.196:2379] (context deadline exceeded)
FATA[0030] execing command in container: command terminated with exit code 1
So, in this case there is nothing todo with RKE2 or rancher I just can't run defrag on etcd when it reaches > 400+MB in size
I am running out of something somewhere but I can't find out what it is
in etcd logs btw:
{"level":"warn","ts":"2024-07-22T19:13:56.869278Z","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"2.000422535s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/registry/health\" ","response":"","error":"context deadline exceeded"}
Yeah, your disk is too slow. I asked earlier what sort of disks these are and didn't see a response. Etcd really needs fast SSD or other non-rotational storage to perform well.
Also, don't manually defrag the datastore. This pauses all IO while the files are rewritten, and is VERY disruptive to the cluster. RKE2 automatically defrags each cluster member when they are starting, before the apiserver is using them to serve traffic.
I was looking at this: https://github.com/k3s-io/k3s/discussions/9207 and this looks somewhat similar
Yeah, your disk is too slow. I asked earlier what sort of disks these are and didn't see a response. Etcd really needs fast SSD or other non-rotational storage to perform well.
Oh, I have missed that. I am using GP3 volume:
Also, don't manually defrag the datastore. This pauses all IO while the files are rewritten, and is VERY disruptive to the cluster. RKE2 automatically defrags each cluster member when they are starting, before the apiserver is using them to serve traffic.
Yeah, thanks, I was doing this on test clusters just to validate.
That should generally be OK. Do you have other IO intensive tasks using these same disks? What do the cloudwatch metrics for these disks show when you get IO latency warnings from etcd?
I think I have managed to improve things by using separate volume for etcd with more IOPS and Throughput. So, indeed storage performance was part of the problem.
Also I have noticed following errors on clustres with database size> 3-4GB:
msg="Failed to process config: failed to process /var/lib/rancher/rke2/server/manifests/rancher/rke2-etcd-snapshot-extra-metadata.yaml: failed to list /v1, Kind=ConfigMap for kube-system/rke2-etcd-snapshot-extra-metadata: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (4063225176 vs. 2147483647)"
@brandond do you have any idea if this limit is configurable anywhere?
Also I have noticed that etcd snapshots are created only on init node, while I was under impression that etcd snapshot supposed to be created on every node with etcd role. I have also noticed your comment here https://github.com/rancher/rke2/issues/6283#issuecomment-2214818000 mentioning same thing.
But in my case I have single node creating 3 snapshots instead of each node creating it's own:
Not sure if that a might be a missconfiguration or behaviour change in rancher.
Oh, ignore
msg="Failed to process config: failed to process /var/lib/rancher/rke2/server/manifests/rancher/rke2-etcd-snapshot-extra-metadata.yaml: failed to list /v1, Kind=ConfigMap for kube-system/rke2-etcd-snapshot-extra-metadata: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (4063225176 vs. 2147483647)"
This is happening i believe because I am actually using large config maps to inflate DB size and when it gets to more than 2GB grpc client hitting 2GB limit.
Still would be nice to know if that is configureable somewhere, I understand that this is not common usage pattern but definitely not something unrealistic.
I have noticed that etcd snapshots are created only on init node
This is https://github.com/rancher/rke2/issues/6325
This is happening i believe because I am actually using large config maps to inflate DB size and when it gets to more than 2GB grpc client hitting 2GB limit.
Yeah I can't say I've seen that before, usually you get a "Request entity too large" error from the apiserver when the ConfigMap is >1MB, before you will run into problems with grpc message size.
Etcd also has a maximum database size that defaults to 2GB. You can raise that up as far as 8GB with etcd args. If you exceed that the datastore will go read-only though, instead of giving you RPC errors.
oh, that's good to know about https://github.com/rancher/rke2/issues/6325 thanks for sharing it.
Yeah I can't say I've seen that before, usually you get a "Request entity too large" error from the apiserver when the ConfigMap is >1MB, before you will run into problems with grpc message size. Etcd also has a maximum database size that defaults to 2GB. You can raise that up as far as 8GB with etcd args. If you exceed that the datastore will go read-only though, instead of giving you RPC errors.
I also thought initially that I am hitting either max resource size or etcd db size limit but apparently it is not.
I have 8GB
for etcd quota-backend-bytes
... failed to list /v1, Kind=ConfigMap ...
I believe it is trying to get/list all config maps or something and total size of them is more than 2GB
trying to send message larger than max (4063225176 vs. 2147483647)
In this case I indeed had 4GB of config maps.
this looks like some grpc client has 2gb limit size which cause rke2-server to fail to process config.
I'm not sure why it would be trying to list all the configmaps just to patch that one, that is pretty odd. This appears to be in the deploy controller (wrangler desiredset/apply) that manages deployment of manifests from the /var/lib/rancher/rke2/server/manifests
dir though, not the snapshot controller itself.
Did you put all 4GB of configmaps in that one file!?
Did you put all 4GB of configmaps in that one file!?
No, no I just filled up cluster with lots of 500kb configmaps to inflate db size to be able to reproduce issues i had when etcd db was larger than 400+MB. So using separate volume for etcd seems helped me with issues I had originally then I thought to see what happens if I will inflate db even more. so then I reached 3GB+ (mostly by creating config maps) I started noticing that error.
Also, I was creating config maps with just kubectl create -f ...
not using /var/lib/rancher/rke2/server/manifests
dir
Just sharing it here perhaps this could be helpful for someone struggling with same issues:
We have managed to get best etcd perf results using aws instance type with NVMe-based SSDs
and use it exclusively for etcd data. This helped to get rid of rke2-server
panics we had initially.
Also looking forward for a fix https://github.com/rancher/rke2/issues/6325 as I believe it cause rke2-server
failures when snapshot creation was initialised:
Jul 25 14:10:44 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:44Z" level=error msg="etcd-snapshot error ID 48259: snapshot save already in progress"
Jul 25 14:10:44 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:44Z" level=error msg="Sending HTTP 500 response to 127.0.0.1:58810: etcd-snapshot error ID 48259"
Jul 25 14:10:44 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:44Z" level=debug msg="Wrote ping"
Jul 25 14:10:46 ip-172-23-72-134 systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
Jul 25 14:10:46 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:46Z" level=info msg="Proxy done" err="context canceled" url="wss://172.23.70.107:9345/v1-rke2/connect"
Jul 25 14:10:46 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:46Z" level=info msg="Shutting down k3s.cattle.io/v1, Kind=Addon workers"
Jul 25 14:10:46 ip-172-23-72-134 rke2[24950]: time="2024-07-25T14:10:46Z" level=info msg="Shutting down /v1, Kind=ServiceAccount workers"
Jul 25 14:10:46 ip-172-23-72-134 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
Especially when db is quite large and it takes about 1min to create snapshot rke2-server
keep trying initiate new snapshot opearation while previous is still running...
As a side note, we wanted to migrate to rancher managed controlplane nodes with machine_pools
but apparently the is no way to configure custom volume mount with current version of terraform provider on rancher2_machine_config_v2
resource. It is not even possible to tune IOPS / Throughput for the root, so we had to stick with AWS ASG custom controlplane nodes.
I am going to close ticket, as main issue is now addressed.
Thanks @brandond for helping with this one 🙇🏽 🍻
We have managed to get best etcd perf results using aws instance type with NVMe-based SSDs and use it exclusively for etcd data.
This has always been a best practice. etcd should basically never be run on rotational storage, and if your workload is IO intensive you definitely need to put the etcd datastore on a separate disk. etcd calls fsync
to flush all outstanding disk IO before acknowledging any datastore write; if your workload is also writing to disk then the workload writes will also need to be synced before the etcd transaction can be completed.
Especially when db is quite large and it takes about 1min to create snapshot rke2-server keep trying initiate new snapshot opearation while previous is still running...
rke2 doesn't retry the snapshot operations. Additionally, your logs show that something external is actively stopping the rke2-server service. I suspect that both of these are being done by rancher-system-agent. You might check the logs over there to see why it's doing that.
@brandond I have tried to use custom etcd data/wal dirs on rke2 standalone cluster and faced with an issue that blocks snapshot creation:
rke2 etcd-snapshot save
...
FATA[0000] etcd database not found in /var/lib/rancher/rke2/server
In rancher managed cluster it works fine. Rancher config is almost identical so I am abit confused why it does not work with rke2 standalone.
It doesn't look like there is a flag to configure rke2 cli for a custom etcd data path either.
Also I did check /var/lib/rancher/rke2/server/db/etcd/config
which is actually pointing to the right data-dir: /opt/kubernetes/etcd/data
and wal-dir: /opt/kubernetes/etcd/wal
Do you have any idea how can I configure rke2 standalone cli to be able to create snapshot with custom etcd data path? 🙏🏽
Oh, I think it might be happening because I am using rke2 standalone 1.27.6, going to try with 1.28.x
Update: it worked fine with 1.28.11.
Environmental Info: RKE2 Version: v1.28.11+rke2r1
Rancher: 2.8.3
Node(s) CPU architecture, OS, and Version:
Cluster Configuration: 3 server 6 workers
Describe the bug: Creating snapshot for managed RKE2 cluster causing issues and sometimes kills server node when etcd dabase is larger than 1G.
Steps To Reproduce:
Expected behavior: Creating snapshot or cluster configuration change happens seamlessly without any issues not affecting cluster availability.
Actual behavior: Creating snapshot or cluster configuration change affects cluster API availability, server nodes fails.
Additional context / logs: Not only creating snapshot is triggering this issue but also cluster configuration changes seems to have similar effect, looks like during applying cluster configuration changes similar processes happening which causing similar issues.
Database size used to reproduce issue:
1.68 GiB
We also have bunch of smaller clusters where this issue doesn't happen.
Sometimes after 10-15 minues of crashlooping it self recovers, but mostly that didn't happen and we have to rotate a server node.
rke2-server service is crash looping panicing with following errors:
1.
2.
3.
full rke2-server log: https://gist.github.com/riuvshyn/05755e51a72694d2abb503aa0915b1bb