siderolabs / cluster-api-control-plane-provider-talos

A control plane provider for CAPI + Talos
Mozilla Public License 2.0
69 stars 20 forks source link

Controlplane machine shows Ready even while etcd is failing #203

Open bitnik opened 4 days ago

bitnik commented 4 days ago

Hello, first of all thanks a lot for your work!

I am quite new both to Cluster API and Talos. While I was playing around with MachineHealthChecks, I realised that a control plane node shows all conditions as True even it can't join the etcd and not ready.

There are 3 CP nodes:

talosctl get members -n 10.1.0.6
NODE       NAMESPACE   TYPE     ID                            VERSION   HOSTNAME                      MACHINE TYPE   OS               ADDRESSES
10.1.0.6   cluster     Member   test-cp-v1-8-3-56nx5      2         test-cp-v1-8-3-56nx5      controlplane   Talos (v1.8.3)   ["10.1.0.4"]
10.1.0.6   cluster     Member   test-cp-v1-8-3-5lvk7      1          test-cp-v1-8-3-5lvk7      controlplane   Talos (v1.8.3)   ["10.1.0.6"]
10.1.0.6   cluster     Member   test-cp-v1-8-3-8sqbv      1         test-cp-v1-8-3-8sqbv      controlplane   Talos (v1.8.3)   ["10.1.0.7"]

There are 2 etcd memberes:

talosctl etcd members -n 10.1.0.6
NODE       ID                 HOSTNAME                   PEER URLS               CLIENT URLS             LEARNER
10.1.0.6   392559fe4b474923   test-cp-v1-8-3-56nx5   https://10.1.0.4:2380   https://10.1.0.4:2379   false
10.1.0.6   bc03d3269012afce   test-cp-v1-8-3-5lvk7   https://10.1.0.6:2380   https://10.1.0.6:2379   false

etcd logs of test-cp-v1-8-3-8sqbv is full of following logs

10.1.0.7: {"level":"error","ts":"2024-11-26T16:38:50.279278Z","caller":"etcdserver/server.go:2378","msg":"Validation on configuration change failed","shouldApplyV3":false,"error":"membership: too many learner members in cluster","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyConfChange\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2378\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).apply\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:2247\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyEntries\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1462\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).applyAll\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1277\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).run.func8\n\tgo.etcd.io/etcd/server/v3/etcdserver/server.go:1149\ngo.etcd.io/etcd/pkg/v3/schedule.(*fifo).run\n\tgo.etcd.io/etcd/pkg/v3@v3.5.16/schedule/schedule.go:157"}
10.1.0.7: {"level":"info","ts":"2024-11-26T16:38:50.279320Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"4986d1fdddfec87b switched to configuration voters=(13547904266639945678) learners=(11059204728065492980)"}
10.1.0.7: {"level":"warn","ts":"2024-11-26T16:38:50.279745Z","caller":"etcdserver/server.go:1154","msg":"server error","error":"the member has been permanently removed from the cluster"}

etcd logs of leader:

10.1.0.6: {"level":"warn","ts":"2024-11-26T16:51:01.859751Z","caller":"rafthttp/http.go:394","msg":"rejected stream from remote peer because it was removed","local-member-id":"bc03d3269012afce","remote-peer-id-stream-handler":"bc03d3269012afce","remote-peer-id-from":"4986d1fdddfec87b"}

Because it is permanently removed, it can't join and be rejected again and again. This is happening sometimes during my tests and the solution is just replacing the node. For this purpose I wanted to configure a MachineHealthCheck, but then I realized that is never detected as unhealthy because all conditions show True:

kubectl get Machine -n test test-cp-7bqxt

NAME                CLUSTER    NODENAME                   PROVIDERID          PHASE     AGE   VERSION
test-cp-7bqxt   test             test-cp-v1-8-3-8sqbv   hcloud://56685282   Running   56m   v1.30.7

kubectl get Machine -n test test-cp-7bqxt -o json | jq '.status.conditions'

[
  {
    "lastTransitionTime": "2024-11-26T15:52:22Z",
    "status": "True",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2024-11-26T15:50:29Z",
    "status": "True",
    "type": "BootstrapReady"
  },
  {
    "lastTransitionTime": "2024-11-26T15:54:43Z",
    "status": "True",
    "type": "HealthCheckSucceeded"
  },
  {
    "lastTransitionTime": "2024-11-26T15:52:22Z",
    "status": "True",
    "type": "InfrastructureReady"
  },
  {
    "lastTransitionTime": "2024-11-26T15:55:06Z",
    "status": "True",
    "type": "NodeHealthy"
  }
]
bitnik commented 2 days ago

Please note that MachineHealthChecks currently only support Machines that are owned by a MachineSet or a KubeadmControlPlane.

I missed this important part in the documentation. Still it is confusing though why it shows Ready as True.