control-plane node stops working after ~1 week: etcd panic: freepages: failed to get all reachable pages

Bug Report

I've set up a bare-metal k8s cluster out of 3 Intel NUC with 1 control-plane/worker node and 2 worker nodes. Installation was flawless and the cluster was running just fine for 1 week. Then the control-plane node broke, resulting in an unusable cluster. This happened twice in the last 2 week (after a fresh reinstall of the etcd node)

I've checked the resources and there is plenty of CPU/MEMORY/DISK available.

Description

etcd service

ID                    etcd
STATE                 Waiting
HEALTH                Fail
LAST HEALTH MESSAGE   context deadline exceeded
EVENTS                [Waiting]: Error running Containerd(etcd), going to restart forever: task "etcd" failed: exit code 2 (5s ago)
                  [Running]: Started task etcd (PID 173095) for container etcd (5s ago)

etcd logs


goroutine 149 [running]:
go.etcd.io/bbolt.(*DB).freepages.func2()
go.etcd.io/bbolt@v1.3.9/db.go:1202 +0x8d
created by go.etcd.io/bbolt.(*DB).freepages in goroutine 148
go.etcd.io/bbolt@v1.3.9/db.go:1200 +0x1e5
{"level":"info","ts":"2024-09-26T10:05:07.659541Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_CIPHER_SUITES","variable-value":"TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"}
{"level":"warn","ts":"2024-09-26T10:05:07.659783Z","caller":"embed/config.go:679","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"info","ts":"2024-09-26T10:05:07.659807Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["/usr/local/bin/etcd","--advertise-client-urls=https://10.10.40.90:2379","--auto-tls=false","--cert-file=/system/secrets/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--experimental-compact-hash-check-enabled=true","--experimental-initial-corrupt-check=true","--experimental-watch-progress-notify-interval=5s","--key-file=/system/secrets/etcd/server.key","--listen-client-urls=https://0.0.0.0:2379","--listen-peer-urls=https://0.0.0.0:2380","--name=k8s-talos-lab-cpw01","--peer-auto-tls=false","--peer-cert-file=/system/secrets/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/system/secrets/etcd/peer.key","--peer-trusted-ca-file=/system/secrets/etcd/ca.crt","--trusted-ca-file=/system/secrets/etcd/ca.crt"]}
{"level":"info","ts":"2024-09-26T10:05:07.659902Z","caller":"etcdmain/etcd.go:94","msg":"detected default host for advertise","host":"10.10.40.90"}
{"level":"info","ts":"2024-09-26T10:05:07.659969Z","caller":"etcdmain/etcd.go:116","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
{"level":"warn","ts":"2024-09-26T10:05:07.659992Z","caller":"embed/config.go:679","msg":"Running http and grpc server on single port. This is not recommended for production."}
{"level":"info","ts":"2024-09-26T10:05:07.660013Z","caller":"embed/etcd.go:127","msg":"configuring peer listeners","listen-peer-urls":["https://0.0.0.0:2380"]}
{"level":"info","ts":"2024-09-26T10:05:07.660074Z","caller":"embed/etcd.go:494","msg":"starting with peer TLS","tls-info":"cert = /system/secrets/etcd/peer.crt, key = /system/secrets/etcd/peer.key, client-cert=, client-key=, trusted-ca = /system/secrets/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":["TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305","TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305"]}
{"level":"info","ts":"2024-09-26T10:05:07.66056Z","caller":"embed/etcd.go:135","msg":"configuring client listeners","listen-client-urls":["https://0.0.0.0:2379"]}
{"level":"info","ts":"2024-09-26T10:05:07.660698Z","caller":"embed/etcd.go:308","msg":"starting an etcd server","etcd-version":"3.5.13","git-sha":"c9063a0dc","go-version":"go1.21.8","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"k8s-talos-lab-cpw01","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://10.10.40.90:2380"],"listen-peer-urls":["https://0.0.0.0:2380"],"advertise-client-urls":["https://10.10.40.90:2379"],"listen-client-urls":["https://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":true,"corrupt-check-time-interval":"0s","compact-check-time-enabled":true,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
panic: freepages: failed to get all reachable pages (the first key[0]=(hex)000000000045b5c65f0000000000000000 on leaf page(6933) needs to be >= the key in the ancestor (000000000045b5c65f0000000000000000000000000045b5c85f0000000000000000000000000045b5ca5f0000000000000000000000000045b5cc5f0000000000000000000000000045b5ce5f0000000000000000000000000045b5d05f0000000000000000000000000045b5d25f0000000000000000000000000045b5d45f0000000000000000000000000045b5d65f0000000000000000000000000045b5d85f000000000000). Stack: [5983 5514 6933])



### Logs

[support.zip](https://github.com/user-attachments/files/17146845/support.zip)

### Environment

- Talos version: 1.7.6
- Using Cilium as CNI

siderolabs / talos

control-plane node stops working after ~1 week: etcd panic: freepages: failed to get all reachable pages #9381

Bug Report

Description