--monitor-health doesn't catch a few bad situations

darkpixel commented 1 year ago

Just a heads up that the --monitor-health flag doesn't catch all bad situations:

root@uspdxopnas01:~# sanoid --monitor-health
OK ZPOOL tank : ONLINE {Size:8.72T Free:4.89T Cap:43%} , OK ZPOOL rpool : ONLINE {Size:222G Free:204G Cap:8%} 
root@uspdxopnas01:~# zpool status -v
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:02 with 0 errors on Mon Dec  5 19:01:03 2022
config:

    NAME                                                     STATE     READ WRITE CKSUM
    rpool                                                    ONLINE       0     0     0
      mirror-0                                               ONLINE       0     0     0
        ata-Samsung_SSD_883_DCT_240GB_S5HMNC0N404814E-part3  ONLINE       0     0     0
        ata-Samsung_SSD_883_DCT_240GB_S5HMNC0N404815Z-part3  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Mon Dec  5 19:00:01 2022
    1.32T scanned at 1.25G/s, 612G issued at 580M/s, 3.84T total
    0B repaired, 15.55% done, 01:37:44 to go
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        ata-INTEL_SSDSC2KG019T8_PHYG0221018F1P9DGN  ONLINE       0     0     0
        ata-INTEL_SSDSC2KG019T8_PHYG0221013Q1P9DGN  ONLINE       0     0     0
        ata-INTEL_SSDSC2KG019T8_PHYG022100991P9DGN  ONLINE       0     0     0
        ata-INTEL_SSDSC2KG019T8_PHYG022100ZS1P9DGN  ONLINE       0     0     0
        ata-INTEL_SSDSC2KG019T8_PHYG022104NG1P9DGN  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x50215>:<0x0>
        <0x4f82d>:<0x0>
        <0x4d462>:<0x0>
        <0x510aa>:<0x0>
        <0x510c0>:<0x0>
        <0x4a6c4>:<0x0>
        <0x500cb>:<0x0>
root@uspdxopnas01:~#

I think it's just paying attention to the state: ONLINE and the fact that the drives are all working properly. It's not paying attention to corruption.

The whole reason I noticed this is because --monitor-snapshots briefly complained about snapshots being old. For some reason Sanoid took the snapshots, but zfs flagged them as corrupt. I deleted them, and that's why the "errors" section above doesn't mention them by name.

phreaker0 commented 1 year ago

@darkpixel --monitor-health will check for corruption but in the READ WRITE CKSUM columns. In most cases these will increment in case of permanent errors but I myself experienced such permanent errors without actual data loss because of code issues. (https://github.com/openzfs/zfs/issues/12014)

darkpixel commented 1 year ago

That's a fun bug @phreaker0. I ran into something similar, but a reboot and a scrub during normal operation usually fixes it. If not, I just nuke the snapshot and re-scub during normal operation to fix it.

phreaker0 commented 1 year ago

@darkpixel I needed to reboot and then scrub to fix the issues, but it would reappear again after some days. I recreated my pool and now the problem is gone.

jimsalterjrs / sanoid

--monitor-health doesn't catch a few bad situations #784