openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

ZED is eager to replace glitching drives #14457

Open pikrzysztof opened 1 year ago

pikrzysztof commented 1 year ago

Describe the feature would like to see added to OpenZFS

I've seen HBAs and drives which glitch for a moment and ZED/ZFS is eager to promote a spare and kick off a resilver. I'd like ZED to wait a bit to see if the drive recovers before promoting a spare.

How will this feature improve OpenZFS?

This would help with:

Additional context

There is a similar effort going on, perhaps we could piggyback on https://github.com/openzfs/zfs/pull/13805 ?

Please see my zpool's events for an event like this - a drive vanished for 0.084 seconds and ZED quickly did its job but I'd like it to wait.

Sep  8 2022 19:43:33.124595428 ereport.fs.zfs.vdev.unknown
        class = "ereport.fs.zfs.vdev.unknown"
        ena = 0x693acfec55d01001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x56627282b162b109
                vdev = 0x9fc27810c3486199
        (end detector)
        pool = "data"
        pool_guid = 0x56627282b162b109
        pool_state = 0x0
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0x9fc27810c3486199
        vdev_type = "disk"
        vdev_path = "/dev/disk/by-ktname/ata-Samsung_SSD_860_EVO_2TB_S45KNWAK101511R-part1"
        vdev_ashift = 0x9
        vdev_complete_ts = 0xec69370839d25
        vdev_delta_ts = 0x243e4
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        vdev_delays = 0x0
        parent_guid = 0x61048eccccf0e48a
        parent_type = "raidz"
        vdev_spare_paths = "/dev/disk/by-ktname/ata-Samsung_SSD_860_EVO_2TB_S45KNWAK101147B-part1" "/dev/d
isk/by-ktname/ata-Samsung_SSD_860_EVO_2TB_S45KNWAK101541Y-part1"
        vdev_spare_guids = 0x43e1fd37ade39185 0x1848c626b06f4856
        prev_state = 0x1
        time = 0x631a45e5 0x76d2ce4
        eid = 0xa605

Sep  8 2022 19:43:33.124595428 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "data"
        pool_guid = 0x56627282b162b109
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x9fc27810c3486199
        vdev_state = "UNAVAIL" (0x4)
        vdev_path = "/dev/disk/by-ktname/ata-Samsung_SSD_860_EVO_2TB_S45KNWAK101511R-part1"
        vdev_laststate = "ONLINE" (0x7)
        time = 0x631a45e5 0x76d2ce4
        eid = 0xa606

Sep  8 2022 19:43:33.208595959 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool = "data"
        pool_guid = 0x56627282b162b109
        pool_state = 0x0
        pool_context = 0x0
        vdev_guid = 0x9fc27810c3486199
        vdev_state = "ONLINE" (0x7)
        vdev_path = "/dev/disk/by-ktname/ata-Samsung_SSD_860_EVO_2TB_S45KNWAK101511R-part1"
        vdev_laststate = "UNAVAIL" (0x4)
        time = 0x631a45e5 0xc6eebf7
        eid = 0xa607
allanjude commented 1 year ago

If you just want to ignore some types of events, you might want #14056 which lets you ask the kernel to retry before reporting the failure to ZFS.