openshift / cluster-etcd-operator

Operator to manage the lifecycle of the etcd members of an OpenShift cluster
Apache License 2.0
95 stars 127 forks source link

ETCD-668: add etcd-backup-server readiness probe #1335

Open Elbehery opened 2 weeks ago

Elbehery commented 2 weeks ago

This PR adds readiness probe to the etcd-backup-server container.

resolves https://issues.redhat.com/browse/ETCD-668

openshift-ci-robot commented 2 weeks ago

@Elbehery: This pull request references ETCD-668 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-etcd-operator/pull/1335): >This PR adds *readiness probe* to the `etcd-backup-server` container. > >resolves https://issues.redhat.com/browse/ETCD-668 Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-etcd-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-etcd-operator/blob/master/OWNERS)~~ [Elbehery] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
Elbehery commented 2 weeks ago

/retest-required

Elbehery commented 2 weeks ago

The current implementation is not working, since the liveness probes fails as long as the etcd-backup-server is disabled, which is the default case .. Hence, the installer pod will always fails

{Operator degraded (StaticPods_Error):

 StaticPodsDegraded: pod/etcd-ip-10-0-26-178.us-east-2.compute.internal container "etcd-backup-server" is waiting: 

CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-backup-server pod=etcd-ip-10-0-26-178.us-east-2.compute.internal_openshift-etcd(c69a15af465abb74b4abb4eb99a96b1c) 

StaticPodsDegraded: pod/etcd-ip-10-0-64-201.us-east-2.compute.internal container "etcd-backup-server" is waiting: 

CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-backup-server pod=etcd-ip-10-0-64-201.us-east-2.compute.internal_openshift-etcd(03e9ef5bb03d08bf7a8b34889b9d2658)  Operator degraded (StaticPods_Error): 

StaticPodsDegraded: pod/etcd-ip-10-0-26-178.us-east-2.compute.internal container "etcd-backup-server" is waiting: 

CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-backup-server pod=etcd-ip-10-0-26-178.us-east-2.compute.internal_openshift-etcd(c69a15af465abb74b4abb4eb99a96b1c)

 StaticPodsDegraded: pod/etcd-ip-10-0-64-201.us-east-2.compute.internal container "etcd-backup-server" is waiting: 

CrashLoopBackOff: back-off 5m0s restarting failed container=etcd-backup-server pod=etcd-ip-10-0-64-201.us-east-2.compute.internal_openshift-etcd(03e9ef5bb03d08bf7a8b34889b9d2658)}

see prow

To mitigate this issue, the liveness probe should be added to the staticPodManifest along side the BackupVars ...

Elbehery commented 2 weeks ago

/label tide/merge-method-squash

openshift-ci[bot] commented 2 weeks ago

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 5095fcb0b9fa2b341c7295f3b5639b9742d0fb97 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-etcd-certrotation 5095fcb0b9fa2b341c7295f3b5639b9742d0fb97 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-aws-etcd-recovery 5095fcb0b9fa2b341c7295f3b5639b9742d0fb97 link false /test e2e-aws-etcd-recovery
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 5095fcb0b9fa2b341c7295f3b5639b9742d0fb97 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).