openshift / hypershift

Hyperscale OpenShift - clusters with hosted control planes
https://hypershift-docs.netlify.app
Apache License 2.0
437 stars 321 forks source link

Operations: Set minReadySeconds on HA deployments #1029

Closed relyt0925 closed 1 year ago

relyt0925 commented 2 years ago

We need to set minReadySeconds to control how fast ha deployments rollout (allow them to be ready for sometime before continuing rollout.

deployment.apps/cluster-api minReadySeconds: deployment.apps/cluster-policy-controller minReadySeconds: deployment.apps/ignition-server minReadySeconds: deployment.apps/konnectivity-agent minReadySeconds: deployment.apps/kube-apiserver minReadySeconds: deployment.apps/kube-controller-manager minReadySeconds: deployment.apps/kube-scheduler minReadySeconds: deployment.apps/oauth-openshift minReadySeconds: deployment.apps/openshift-apiserver minReadySeconds: deployment.apps/openshift-controller-manager minReadySeconds: deployment.apps/openshift-oauth-apiserver minReadySeconds: deployment.apps/packageserver minReadySeconds:

sjenning commented 2 years ago

Default minReadySeconds is 0. What problem is this solving exactly? What value should it be?

I'm inclined to change any problem component to not report ready until it is actually ready, assuming that is the issue.

relyt0925 commented 2 years ago

We will get our expert ops team to place some comments on this one for some of the choosen values and their purpose. The main thing though is to control the velocity of the rollout

relyt0925 commented 2 years ago

@rtheis (sorry for all the pings): Did you have any general background on why the 15 seconds were chosen for most of the deployments for minReadySeconds? I know part of it came from when components were restarting fast over one another at scale it put a lot of load on the management APIServer.

That would help kickoff further discussions on this one.

rtheis commented 2 years ago

Hi folks. Here is the general guidance that we provide our teams with respect to readiness. We prefer probes as the primary means of determining readiness. However, we also use minReadySeconds to ensure stability and availability during rollouts to prevent rollouts from proceeding so quickly that an app update results in all pods crashing. We also use this as a means to protect our managed environment from pod restart storms.

Microservices that do have a readiness probe should set minReadySeconds to 15 and those without a probe should set it to 30, to assist in a controlled rollout of the microservice pods. The general goal is to have a microservice pod only report to Kubernetes as being ready when it has completed initialization and is stable enough to complete tasks.

My advise is for Hypershift control plane components to either use readiness probes (if available) or set minReadySeconds. Using both readiness probes and minReadySeconds is acceptable as well.

From an OpenShift perspective, the cluster policy controller has been troublesome for us. While working to handle its availability during rollouts, we hit https://github.com/kubernetes/kubernetes/issues/108266, which until fixed, breaks one of the reasons that we use minReadySeconds.

I hope this helps.

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

rtheis commented 2 years ago

/remove-lifecycle stale

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

rtheis commented 2 years ago

/remove-lifecycle stale

openshift-bot commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

rtheis commented 2 years ago

/remove-lifecycle stale

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

rtheis commented 1 year ago

/remove-lifecycle stale

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

rtheis commented 1 year ago

/remove-lifecycle stale

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 1 year ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot commented 1 year ago

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci[bot] commented 1 year ago

@openshift-bot: Closing this issue.

In response to [this](https://github.com/openshift/hypershift/issues/1029#issuecomment-1773536551): >Rotten issues close after 30d of inactivity. > >Reopen the issue by commenting `/reopen`. >Mark the issue as fresh by commenting `/remove-lifecycle rotten`. >Exclude this issue from closing again by commenting `/lifecycle frozen`. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.