test_alert_triggered cant fire MDSCpuUsageHigh. Description shows we need to wait 6H to create conditions

DanielOsypenko commented 2 months ago

tests/functional/monitoring/prometheus/alerts/test_alert_mds_cpu_high_usage.py::TestMdsCpuAlerts::test_alert_triggered

https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHigh.md To diagnose the alert, click on the workloads->pods and select the corresponding MDS pod and click on the metrics tab. You should be able to see the allocated and used CPU. By default, the alert is fired if the used CPU is 67% of allocated CPU for 6 hours. If this is the case take the steps mentioned in mitigation.

we need to correct test to adjust prometheus rule wait less than 6 hours

[2024-09-26T16:35:26.441Z] ____________________ TestMdsCpuAlerts.test_alert_triggered _____________________
[2024-09-26T16:35:26.441Z] 
[2024-09-26T16:35:26.441Z] self = <tests.functional.monitoring.prometheus.alerts.test_alert_mds_cpu_high_usage.TestMdsCpuAlerts object at 0x7f0004d1f6a0>
[2024-09-26T16:35:26.441Z] run_file_creator_io_with_cephfs = None
[2024-09-26T16:35:26.441Z] threading_lock = <unlocked _thread.RLock object owner=0 count=0 at 0x7f000411da20>
[2024-09-26T16:35:26.441Z] 
[2024-09-26T16:35:26.441Z]     @pytest.mark.polarion_id("OCS-5581")
[2024-09-26T16:35:26.441Z]     def test_alert_triggered(self, run_file_creator_io_with_cephfs, threading_lock):
[2024-09-26T16:35:26.441Z]         """
[2024-09-26T16:35:26.441Z]         This test case is to verify the alert for MDS cpu high usage
[2024-09-26T16:35:26.441Z]     
[2024-09-26T16:35:26.441Z]         Args:
[2024-09-26T16:35:26.441Z]         run_file_creator_io_with_cephfs: function to generate load on mds cpu to achieve "cpu utilisation >67%"
[2024-09-26T16:35:26.441Z]         threading_lock: to pass the threading lock in alert validation function
[2024-09-26T16:35:26.441Z]     
[2024-09-26T16:35:26.441Z]         """
[2024-09-26T16:35:26.441Z]         log.info(
[2024-09-26T16:35:26.441Z]             "File creation IO started in the background."
[2024-09-26T16:35:26.441Z]             " Script will look for MDSCPUUsageHigh  alert"
[2024-09-26T16:35:26.441Z]         )
[2024-09-26T16:35:26.441Z] >       assert active_mds_alert_values(threading_lock)
[2024-09-26T16:35:26.441Z] 
[2024-09-26T16:35:26.441Z] [1m[31m/home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/tests/functional/monitoring/prometheus/alerts/test_alert_mds_cpu_high_usage.py[0m:111: 
[2024-09-26T16:35:26.441Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-09-26T16:35:26.441Z] [1m[31m/home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/tests/functional/monitoring/prometheus/alerts/test_alert_mds_cpu_high_usage.py[0m:70: in active_mds_alert_values
[2024-09-26T16:35:26.441Z]     prometheus.check_alert_list(
[2024-09-26T16:35:26.441Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
[2024-09-26T16:35:26.441Z] 
[2024-09-26T16:35:26.441Z] label = 'MDSCPUUsageHigh'
[2024-09-26T16:35:26.441Z] msg = 'Ceph metadata server pod (rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-68d96f78gtw7j) has high cpu usage'
[2024-09-26T16:35:26.441Z] alerts = [], states = ['pending'], severity = 'warning'
[2024-09-26T16:35:26.441Z] ignore_more_occurences = True
[2024-09-26T16:35:26.441Z] description = 'Ceph metadata server pod (rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-68d96f78gtw7j) has high cpu usage.\nPleas...e CPU request for the rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-68d96f78gtw7j pod as described in the runbook.'
[2024-09-26T16:35:26.441Z] runbook = 'https://github.com/openshift/runbooks/blob/master/alerts/openshift-container-storage-operator/CephMdsCpuUsageHigh.md'
[2024-09-26T16:35:26.441Z] 
[2024-09-26T16:35:26.441Z]     def check_alert_list(
[2024-09-26T16:35:26.441Z]         label,
[2024-09-26T16:35:26.441Z]         msg,
[2024-09-26T16:35:26.441Z]         alerts,
[2024-09-26T16:35:26.441Z]         states,
[2024-09-26T16:35:26.441Z]         severity="warning",
[2024-09-26T16:35:26.441Z]         ignore_more_occurences=True,
[2024-09-26T16:35:26.441Z]         description=None,
[2024-09-26T16:35:26.441Z]         runbook=None,
[2024-09-26T16:35:26.441Z]     ):
[2024-09-26T16:35:26.441Z]         """
[2024-09-26T16:35:26.441Z]         Check list of alerts that there are alerts with requested label and
[2024-09-26T16:35:26.441Z]         message for each provided state. If some alert is missing then this check
[2024-09-26T16:35:26.441Z]         fails.
[2024-09-26T16:35:26.441Z]     
[2024-09-26T16:35:26.441Z]         Args:
[2024-09-26T16:35:26.441Z]             label (str): Alert label
[2024-09-26T16:35:26.441Z]             msg (str): Alert message
[2024-09-26T16:35:26.441Z]             alerts (list): List of alerts to check
[2024-09-26T16:35:26.441Z]             states (list): List of states to check, order is important
[2024-09-26T16:35:26.441Z]             ignore_more_occurences (bool): If true then there is checkced only
[2024-09-26T16:35:26.441Z]                 occurence of alert with requested label, message and state but
[2024-09-26T16:35:26.441Z]                 it is not checked if there is more of occurences than one.
[2024-09-26T16:35:26.441Z]             description (str): Alert description
[2024-09-26T16:35:26.441Z]             runbook (str): Alert's runbook URL
[2024-09-26T16:35:26.441Z]     
[2024-09-26T16:35:26.441Z]         """
[2024-09-26T16:35:26.441Z]         target_alerts = [
[2024-09-26T16:35:26.441Z]             alert for alert in alerts if alert.get("labels").get("alertname") == label
[2024-09-26T16:35:26.441Z]         ]
[2024-09-26T16:35:26.441Z]         logger.info(f"Checking properties of found {label} alerts")
[2024-09-26T16:35:26.441Z]     
[2024-09-26T16:35:26.441Z]         for key, state in enumerate(states):
[2024-09-26T16:35:26.441Z]             found_alerts = [
[2024-09-26T16:35:26.441Z]                 alert
[2024-09-26T16:35:26.441Z]                 for alert in target_alerts
[2024-09-26T16:35:26.441Z]                 if alert["annotations"]["message"] == msg
[2024-09-26T16:35:26.441Z]                 and alert["annotations"]["severity_level"] == severity
[2024-09-26T16:35:26.441Z]                 and alert["state"] == state
[2024-09-26T16:35:26.441Z]             ]
[2024-09-26T16:35:26.441Z]             assert_msg = (
[2024-09-26T16:35:26.441Z]                 f"There was not found alert {label} with message: {msg}, "
[2024-09-26T16:35:26.441Z]                 f"severity: {severity} in state: {state}"
[2024-09-26T16:35:26.441Z]                 f"Alerts matched with alert name are {target_alerts}"
[2024-09-26T16:35:26.442Z]                 f"Alerts matched with given message, severity and state are {found_alerts}"
[2024-09-26T16:35:26.442Z]             )
[2024-09-26T16:35:26.442Z] >           assert found_alerts, assert_msg
[2024-09-26T16:35:26.442Z] [1m[31mE           AssertionError: There was not found alert MDSCPUUsageHigh with message: Ceph metadata server pod (rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-68d96f78gtw7j) has high cpu usage, severity: warning in state: pendingAlerts matched with alert name are []Alerts matched with given message, severity and state are [][0m
[2024-09-26T16:35:26.442Z] 
[2024-09-26T16:35:26.442Z] [1m[31m/home/jenkins/workspace/qe-deploy-ocs-cluster/ocs-ci/ocs_ci/utility/prometheus.py[0m:69: AssertionError

DanielOsypenko commented 2 months ago

Hi @nagendra202 , please check this test. As I understand we can not create conditions for this alert. We need to find a way to alter the alert rule or remove test from executions.

nagendra202 commented 2 months ago

Hi @DanielOsypenko , This test will pass if the alert found in 'pending' state when the CPU utilisation is just above 67%. We cannot change prometheus rules to reduce the alert time since it is not a recommended way.

nagendra202 commented 2 months ago

With the help of this test, we just certify that the CPU alert feature is still available and an alert will be in queue [pending state] when the limit crossed. At the same time we also verify the alert properties. This is the only test we have automated so far in this feature. We cannot automate the rest all. We need atleast this test to make sure the feature exists.

DanielOsypenko commented 2 months ago

Ack. Thanks for explanations. Previously I've opened bz on the active mds node is down, it is not firing the alert too. So according to the failure that happened here we've faced the same bug but with a non-distruptive sunny day scenario. Please correct me if I am wrong. Here the monitoring suit run - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/42472/ I'd love to have your opinion before correcting the bug.

nagendra202 commented 2 weeks ago

No failures seen in past runs except one time due to some timing issue. Based on previous comment, closing this issues as no fix needed.

red-hat-storage / ocs-ci

test_alert_triggered cant fire MDSCpuUsageHigh. Description shows we need to wait 6H to create conditions #10591