Closed remmen-io closed 2 weeks ago
This is an issue with etcd, not with Talos itself (so you can submit to the etcd
projects directly).
As far as I know there is no known software issue with DB corruption with etcd at the moment, so this should be hardware issue, see e.g. https://github.com/etcd-io/etcd/issues/10722
If you run a single controlplane node cluster, you certainly take the risk of losing the only controlplane node. In that case, etcd backups are mandatory, and Talos documents the procedure to recover etcd after member failures.
Bug Report
I've set up a bare-metal k8s cluster out of 3 Intel NUC with 1 control-plane/worker node and 2 worker nodes. Installation was flawless and the cluster was running just fine for 1 week. Then the control-plane node broke, resulting in an unusable cluster. This happened twice in the last 2 week (after a fresh reinstall of the etcd node)
I've checked the resources and there is plenty of CPU/MEMORY/DISK available.
Description
etcd service
etcd logs