Open Fizzadar opened 1 week ago
Can you provide some information on your environment? Specifically:
time etcdctl defrag [...]
on this node? etcdctl endpoint status -w json
?iostat -x -d 2
running before starting the defrag, and capture its output through the end of the defrag process?Those are dedicated nodes but their disk throughput wasn't enough to complete an eight gig defrag in 30 seconds. It's just tad too tight at our scale and started to go beyond 30 seconds, less than 40 anyway. We have just shy of 100k pods and 100 nodes on these clusters and a lot of ConfigMaps and such which balloons etcd quite a bit so we do expect it to require more than average amount of time to do things.
We also had to rescue a cluster during this weekend which ended up losing all three control nodes around the same time due to rke2-server service being stuck in timing out waiting for defrag. Our quick fix was to rebuild the rke2
binary with higher timeout as requested here so that waiting for the defrag completed and we could get our control plane back up.
Either having it configurable or the default higher would be helpful. If there's no specific reason for 30 seconds I'd suggest making it 5 minutes to allow it enough time in all kinds of resource constraint situations. Some control planes run on VMs with non-guaranteed performance so hitting this at 30 seconds is more than possible with a smaller cluster as well.
I'm not exactly sure which project is responsible for what parts of the RKE2 stack so we may need to open another issue on RKE2 side for it not to fail to start if defrag doesn't complete in the expected time as it shouldn't be a hard requirement for startup.
Thanks for the additional information about your environment. Can you provide any of the specifically requested information regarding performance of the nodes in question? While just changing or allowing configuration of the timeout is certainly and option, we'd like to also better understand what performance profiles makes this necessary in the first place.
What we'll probably do is move the defrag out from the etcd status check context deadline, so that the 30 second timeout does not affect the defrag and alarm clear operations. At that point we can evaluate if a timeout on the defrag is even necessary.
If anyone with a large datastore affected by this issue can provide the info requested at https://github.com/k3s-io/k3s/issues/11122#issuecomment-2423000976 that would be appreciated
Related Internal Requests: SURE-9222 SURE-9233
Ah, sorry for not getting back faster. We replaced the nodes with even more powerful ones (though both do have NVMe drives) and got the times to around 17 seconds.
Thanks for landing a fix!
The hardcoded 30s timeout is too low for defragmentation to complete on larger (8+G) etcd shards:
https://github.com/k3s-io/k3s/blob/c0d661b334bc3cbe30d80e5aab6af4b92d3eb503/pkg/etcd/etcd.go#L58
I realize we're well beyond the "standard" performance envelope of etcd but it works just fine for us with multiple clusters where etcd is 8G or even 10G in size (our quota limit is 16G). Literally the only thing making this near-impossible to scale beyond is this hardcoded timeout, would it be possible to make it configurable? Can PR if this would be of interest...
Note we're using rke2 (1.26.15), but this is the timeout we're hitting.