openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
96 stars 130 forks source link

The etcd restart too often (compared to kubeadm installation method) #1031

Closed lance5890 closed 1 year ago

lance5890 commented 1 year ago
  1. I found the etcd restart too often ,compared to kubeadm installation method
  2. Maybe this has something to do with the health check , in the ocp, the health check is conducted by the ceo func,

I will complete this issue when I found more evidence

lance5890 commented 1 year ago

@Elbehery @tjungblu

Elbehery commented 1 year ago

Thanks @lance5890 for filling this

Please give us more details whenever possible :)

lance5890 commented 1 year ago
  1. By the kubeadm installation, the etcd health check is conducted as follows, which use the localhost 2381 metrics port as the health check :

    image image
  2. But in the ocp installation, the etcd health check is conducted by the etcd-health-monitor, which use the WithQuorumRead func to check the etcd health as follows:

https://github.com/openshift/cluster-etcd-operator/blob/7a8db9c7132dd171a3a9fd31504cedffc1ea5af5/pkg/cmd/monitor/monitor.go#L161-L171

  1. The ocp use etcd QuorumRead to check the etcd health is very heavy operation, as we all now, the etcd quorum is Linearizableļ¼Œit is heavily dependent on cluster performance
  2. If the ocp cluster use hdd disk(not the ssd), and the health check will fail too often, and in this time , restart the etcd will exacerbate cluster instability
lance5890 commented 1 year ago

Maybe we should use the light way to check the etcd health , but not depend on the cluster performance @Elbehery @tjungblu

Elbehery commented 1 year ago

any updates here ?

lance5890 commented 1 year ago

any updates here ?

the ocp etcd health check is conducted by too heavy, compared to the metrics way by the kubeadm

Elbehery commented 1 year ago

@lance5890 to make this decision we need some performance proofs imho

cc @tjungblu @hasbro17 @dusk125

lance5890 commented 1 year ago

as release 4.12 has no health-monitor container any more, i will close this issue and check