openshift / cluster-kube-controller-manager-operator

The kube-controller-manager operator installs and maintains the kube-controller-manager on a cluster
Apache License 2.0
43 stars 125 forks source link

Update PodDisruptionBudgetAtLimit alert #810

Closed sradco closed 2 months ago

sradco commented 3 months ago

Update the PodDisruptionBudgetAtLimit alert to not fire for kubevirt-disruption-budge, since they are not needed in this use case.

This PR is very important since the alerts fire for every namespace that is running VM, even tough it is not needed and is causing an alert fatigue.

Jira-url: https://issues.redhat.com/browse/CNV-33834

Signed-off-by: Shirly Radco sradco@redhat.com

ingvagabund commented 3 months ago

@sradco is there any card tracking this change?

sradco commented 3 months ago

/assign ingvagabund Hi Jan, I will appreciate your review. This is a high priority issue for OpenShift Virtualization that is causing a lot of noise for our users.

ingvagabund commented 3 months ago

Is there a customer case? A bug report?

sradco commented 3 months ago

Is there a customer case? A bug report?

Yes. updated also in the description now. https://issues.redhat.com/browse/CNV-33834.

atiratree commented 3 months ago

This has been also suggested in https://github.com/openshift/cluster-kube-controller-manager-operator/pull/583# , but was ultimately discarded in favor of silencing.

There is a need for more generic and complete approach. Alternative solution to eviction is being discussed in the Evacuation API KEP upstream.

atiratree commented 3 months ago

/hold

openshift-ci[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sradco Once this PR has been reviewed and has the lgtm label, please assign ingvagabund for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/openshift/cluster-kube-controller-manager-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
sradco commented 3 months ago

@atiratree Hi, This doesn't align with the effort to reduce alerts noise. We have an alert to address this issue with a dedicated runbook that explain how this should be addressed in the VMs use case, https://github.com/kubevirt/monitoring/blob/main/docs/runbooks/VMCannotBeEvicted.md.

An alert requires an action by the user, and the PodDisruptionBudgetAtLimit alert only adds confusion in this case, since it will not recommend the correct action, https://github.com/openshift/runbooks/blob/master/alerts/cluster-kube-controller-manager-operator/PodDisruptionBudgetAtLimit.md#mitigation. In our case the admin needs to set the evictionStrategy of the VMI to shutdown.

The user will be confused and bothered by this alert until he either opens a support case or finds the kbase article, which is not a reasonable alternative.

openshift-ci[bot] commented 3 months ago

@sradco: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
sradco commented 2 months ago

Closing this for now since we decided to try to silence the alert, using the silences API. It is the the only way other that updating the expression of the alert to actually impact the UI and not only the Alertmanager forwarder.