node_md_state did not capture "removed" state

levindecaro commented 2 years ago

Host operating system: output of `uname -a`

Linux sds-3 4.18.0-305.7.1.el8_4.x86_64 #1 SMP Tue Jun 29 21:55:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

node_exporter, version 1.3.1 (branch: HEAD, revision: a2321e7b940ddcff26873612bccdf7cd4c42b6b6) build user: root@243aafa5525c build date: 20211205-11:09:49 go version: go1.17.3 platform: linux/amd64

node_exporter command line flags

/usr/local/bin/node_exporter --path.procfs=/proc --path.sysfs=/sys --collector.filesystem.ignored-mount-points="^/(dev|proc|sys|var/lib/docker/.+)($|/)" --collector.filesystem.ignored-fs-types="^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$" --no-collector.wifi

Are you running node_exporter in Docker?

no

What did you do that produced an error?

mdadm -D output

/dev/md125:
           Version : 1.0
     Creation Time : Mon Jul  5 19:40:20 2021
        Raid Level : raid1
        Array Size : 614336 (599.94 MiB 629.08 MB)
     Used Dev Size : 614336 (599.94 MiB 629.08 MB)
      Raid Devices : 2
     Total Devices : 1
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun May 22 01:00:01 2022
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : sds-3:boot_efi  (local to host sds-3)
              UUID : 312be27c:732e4a9e:6b279d78:10cd6a6a
            Events : 177

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1       8       18        1      active sync   /dev/sdb2

What did you expect to see?

node_md_state{device="md125", instance="sds-3", job="sds-nodes", state="removed"}

What did you see instead?

"removed" state metric not yet implemented in node_md_state

discordianfish commented 2 years ago

Yeah I see how this would be useful

dswarbrick commented 1 year ago

@levindecaro Your expected metric would be inaccurate, because it's not the whole md125 array that has been removed, but rather just one of the component devices. From the output of your mdadm command, the md125 array is still functioning (and would continue to do so, since it's raid1 and still has one leg working).

What you instead need is a metric for the state of individual component devices if you want to see if they have been removed.

However, you could also have alerted on the condition that you encountered with a node_md_disks{state="failed"} > 0 alerting rule. Alternatively, node_md_disks_required - node_md_disks{state="active"} > 0 would probably also do the trick.

Having said that, the existing implementation of the procfs library's parsing of /proc/mdstat masks some of the low-level details and this is why I have proposed a new direction with https://github.com/prometheus/procfs/pull/509.

prometheus / node_exporter