siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.64k stars 531 forks source link

control-plane node stops working after ~1 week: etcd panic: freepages: failed to get all reachable pages #9381

Closed remmen-io closed 2 weeks ago

remmen-io commented 2 weeks ago

Bug Report

I've set up a bare-metal k8s cluster out of 3 Intel NUC with 1 control-plane/worker node and 2 worker nodes. Installation was flawless and the cluster was running just fine for 1 week. Then the control-plane node broke, resulting in an unusable cluster. This happened twice in the last 2 week (after a fresh reinstall of the etcd node)

I've checked the resources and there is plenty of CPU/MEMORY/DISK available.

Description



### Logs

[support.zip](https://github.com/user-attachments/files/17146845/support.zip)

### Environment

- Talos version: 1.7.6
- Using Cilium as CNI
smira commented 2 weeks ago

This is an issue with etcd, not with Talos itself (so you can submit to the etcd projects directly).

As far as I know there is no known software issue with DB corruption with etcd at the moment, so this should be hardware issue, see e.g. https://github.com/etcd-io/etcd/issues/10722

If you run a single controlplane node cluster, you certainly take the risk of losing the only controlplane node. In that case, etcd backups are mandatory, and Talos documents the procedure to recover etcd after member failures.