samber / awesome-prometheus-alerts

🚨 Collection of Prometheus alerting rules
https://samber.github.io/awesome-prometheus-alerts/
Other
6.59k stars 1.01k forks source link

Rule "Host RAID array got inactive" has misleading description #395

Open jlherren opened 9 months ago

jlherren commented 9 months ago

Rule 1.2.25. Host RAID array got inactive has a misleading description that does not match its expression:

Expression: (node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}

Description:

RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.

I had a disk causing issues in a RAID 1. I manually failed the disk (mdadm --fail) upon which the rule 1.2.26. Host RAID disk failure correctly reported it. Then I removed the bad disk from the RAID (mdadm --remove), after which 1.2.26 no longer reports it, because the disk is no longer in a failed state. But now rule 1.2.25 does not report anything, since the RAID is still active and the server still fully operational. The RAID is just degraded, which the rule doesn't actually check for, contrary to the description.

I suppose the description should be fixed and a new rule be added to detect degraded RAIDs. I tried node_md_disks{state="active"} < node_md_disks_required but it doesn't seem to work (I'm not so proficient in the query language).

My metrics md0/md1/md2 are all RAID1 on the same two disks.
# HELP node_md_disks Number of active/failed/spare disks of device.
# TYPE node_md_disks gauge
node_md_disks{device="md0",state="active"} 1
node_md_disks{device="md0",state="failed"} 0
node_md_disks{device="md0",state="spare"} 0
node_md_disks{device="md1",state="active"} 1
node_md_disks{device="md1",state="failed"} 0
node_md_disks{device="md1",state="spare"} 0
node_md_disks{device="md2",state="active"} 1
node_md_disks{device="md2",state="failed"} 0
node_md_disks{device="md2",state="spare"} 0
# HELP node_md_disks_required Total number of disks of device.
# TYPE node_md_disks_required gauge
node_md_disks_required{device="md0"} 2
node_md_disks_required{device="md1"} 2
node_md_disks_required{device="md2"} 2
# HELP node_md_state Indicates the state of md-device.
# TYPE node_md_state gauge
node_md_state{device="md0",state="active"} 1
node_md_state{device="md0",state="check"} 0
node_md_state{device="md0",state="inactive"} 0
node_md_state{device="md0",state="recovering"} 0
node_md_state{device="md0",state="resync"} 0
node_md_state{device="md1",state="active"} 1
node_md_state{device="md1",state="check"} 0
node_md_state{device="md1",state="inactive"} 0
node_md_state{device="md1",state="recovering"} 0
node_md_state{device="md1",state="resync"} 0
node_md_state{device="md2",state="active"} 1
node_md_state{device="md2",state="check"} 0
node_md_state{device="md2",state="inactive"} 0
node_md_state{device="md2",state="recovering"} 0
node_md_state{device="md2",state="resync"} 0
guruevi commented 8 months ago

Fixed in #405