pdf / zfs_exporter

Prometheus ZFS exporter
MIT License
172 stars 32 forks source link

[BUG] exporter not noticing write errors #44

Closed Eldiabolo21 closed 1 month ago

Eldiabolo21 commented 1 month ago

Hello!

First of all, thank you for your work and time, its really the best (and best maintained) zfs exporter out there!

One thing I noticed is that, the exporter doesnt pick up health warning when an unrecoverable error occured. For example:

# zpool status -x
  pool: tank3
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 12:14:12 with 0 errors on Sun Sep  8 12:38:13 2024
config:

        NAME                        STATE     READ WRITE CKSUM
        tank3                       ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x5000cca232273cd8  ONLINE       0     0     0
            wwn-0x5000cca2310cafe8  ONLINE       0     0     0
            wwn-0x5000cca232293c78  ONLINE       1     0     0
            wwn-0x5000cca2310c9184  ONLINE       0     0     0
            wwn-0x5000cca2310c9160  ONLINE       0     0     0
            wwn-0x5000cca2310c82e0  ONLINE       0     0     0
            wwn-0x5000cca2310c775c  ONLINE       0     0     0
            wwn-0x5000cca2322937c8  ONLINE       0     0     0

errors: No known data errors

Prometheus metrics:

zfs_pool_health{instance="192.168.16.12:9134", job="zfs", pool="tank3"} 0
zfs_pool_health{instance="192.168.16.12:9134", job="zfs", pool="virt"} 0
zfs_pool_health{instance="192.168.16.12:9134", job="zfs", pool="virt2"} 0

Thats not exactly a problem but still something that should be caught and possible notified. Any chance to set the health value to 1 as well in these cases?

Cheers!


Edit: exporter version 2.3.1

pdf commented 1 month ago

The zfs_pool_health metric has a specific meaning that maps to the status as reported by ZFS, per the HELP text on the metric:

# HELP zfs_pool_health Health status code for the pool [0: ONLINE, 1: DEGRADED, 2: FAULTED, 3: OFFLINE, 4: UNAVAIL, 5: REMOVED, 6: SUSPENDED].

It would be nice to publish detailed per-vdev status, but unfortunately zpool status is one of the few commands that doesn't provide a machine-parseable output flag, which would make it somewhat brittle to parse.

Duplicate of #5