Closed DanielOsypenko closed 2 weeks ago
Hi @nagendra202 , please check this test. As I understand we can not create conditions for this alert. We need to find a way to alter the alert rule or remove test from executions.
Hi @DanielOsypenko , This test will pass if the alert found in 'pending' state when the CPU utilisation is just above 67%. We cannot change prometheus rules to reduce the alert time since it is not a recommended way.
With the help of this test, we just certify that the CPU alert feature is still available and an alert will be in queue [pending state] when the limit crossed. At the same time we also verify the alert properties. This is the only test we have automated so far in this feature. We cannot automate the rest all. We need atleast this test to make sure the feature exists.
Ack. Thanks for explanations. Previously I've opened bz on the active mds node is down, it is not firing the alert too. So according to the failure that happened here we've faced the same bug but with a non-distruptive sunny day scenario. Please correct me if I am wrong. Here the monitoring suit run - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/42472/ I'd love to have your opinion before correcting the bug.
No failures seen in past runs except one time due to some timing issue. Based on previous comment, closing this issues as no fix needed.
tests/functional/monitoring/prometheus/alerts/test_alert_mds_cpu_high_usage.py::TestMdsCpuAlerts::test_alert_triggered
https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHigh.md To diagnose the alert, click on the workloads->pods and select the corresponding MDS pod and click on the metrics tab. You should be able to see the allocated and used CPU. By default, the alert is fired if the used CPU is 67% of allocated CPU for 6 hours. If this is the case take the steps mentioned in mitigation.
we need to correct test to adjust prometheus rule wait less than 6 hours