Open AeonJJohnson opened 5 years ago
I've looked at zfs/cmd/zed/agents/zfs_retire.c and I cannot identify why the resilver to hot-spare doesn't execute.
Detailed zpool events: https://pastebin.com/13r4UDL6
I tried to reproduce this in master
and got mixed results. When I force faulted a drive with zpool offline -f
, the spare did not kick in:
pool: tank
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: none requested
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
/home/hutter/zfs/file1 ONLINE 0 0 0
/home/hutter/zfs/file2 FAULTED 0 0 0 external device fault
spares
/home/hutter/zfs/file3 AVAIL
However, I did see the spare kick in on the auto_spare_001_pos
test case, and when I manually tested with zinject errors into the vdev to fault it:
pool: tank
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 290K in 0 days 00:00:00 with 0 errors on Mon Jul 1 15:40:17 2019
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 0
/home/hutter/zfs/file1 FAULTED 0 1.28K 48 too many errors
/home/hutter/zfs/file3 ONLINE 0 0 0
/home/hutter/zfs/file2 ONLINE 0 0 0
spares
/home/hutter/zfs/file3 INUSE currently in use
Jul 01 15:40:36 zed[5897]: eid=1347 class=io pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file1
Jul 01 15:40:37 zed[5910]: eid=1348 class=io pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file1
Jul 01 15:40:37 zed[5923]: eid=1349 class=io pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file1
Jul 01 15:40:37 zed[5936]: eid=1350 class=io pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file1
Jul 01 15:40:38 zed[5949]: eid=1351 class=io pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file1
Jul 01 15:40:38 zed[5962]: eid=1352 class=io pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file1
Jul 01 15:40:38 zed[5975]: eid=1353 class=io pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file1
Jul 01 15:40:39 zed[5988]: eid=1354 class=statechange pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file1 vdev_state=FAULTED
Jul 01 15:40:40 zed[6007]: eid=1355 class=scrub_start pool_guid=0x8EEB0F154CDC4AF8
Jul 01 15:40:40 zed[6020]: eid=1356 class=history_event pool_guid=0x8EEB0F154CDC4AF8
Jul 01 15:40:40 zed[6034]: eid=1357 class=history_event pool_guid=0x8EEB0F154CDC4AF8
Jul 01 15:40:41 zed[6048]: eid=1358 class=scrub_start pool_guid=0x8EEB0F154CDC4AF8
Jul 01 15:40:41 zed[6061]: eid=1359 class=history_event pool_guid=0x8EEB0F154CDC4AF8
Jul 01 15:40:42 zed[6075]: eid=1360 class=config_sync pool_guid=0x8EEB0F154CDC4AF8
Jul 01 15:40:43 zed[6088]: eid=1361 class=vdev_spare pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file3 vdev_state=ONLINE
Jul 01 15:40:43 zed[6101]: eid=1362 class=vdev_attach pool_guid=0x8EEB0F154CDC4AF8 vdev_path=/home/hutter/zfs/file3 vdev_state=ONLINE
Would the zpool offline -f
generate the event message that zed agents are looking for? In cmd/zed/agents/zfs_diagnosis.c it appears to be looking for io, checksum or probe failures.
I was hoping there was a way, perhaps with zdb to echoing a string to a module parameter handle in sysfs that a message could be injected as a zfs event to diagnose what or how zed is engaging.
see zinject
So I just tried zpool offline -f
again, and it did activate the spare. Then I rebooted my VM, tried it again, and it didn't activate the spare. I'll need to do some more digging...
I suspect what we really need is a "test spare activation after a drive has disappeared" test case. The current test cases just test faults via injecting IO/checksum errors. @AeonJJohnson mentioned that this is from a multipathed SAS array, which, if it's like the arrays I've seen, often have bad drives disappear and reappear. That may help us reproduce what @AeonJJohnson is seeing.
@AeonJJohnson did you happen to see any SCSI/multipath errors in dmesg at the time of the disk failure?
There were SCSI errors (hardware/media on writes)
Jun 28 00:02:17 myhost kernel: sd 0:0:124:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 00:02:17 myhost kernel: sd 0:0:124:0: [sda] tag#0 Sense Key : Hardware Error [current] [descriptor]
Jun 28 00:02:17 myhost kernel: sd 0:0:124:0: [sda] tag#0 ASC=0x44 <<vendor>>ASCQ=0xf9
Jun 28 00:02:17 myhost kernel: sd 0:0:124:0: [sda] tag#0 CDB: Write(10) 2a 00 00 00 00 3f 00 00 01 00
Jun 28 00:02:17 myhost kernel: sd 0:0:124:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 00:02:17 myhost kernel: sd 0:0:124:0: [sda] tag#0 Sense Key : Hardware Error [current] [descriptor]
Jun 28 00:02:17 myhost kernel: sd 0:0:124:0: [sda] tag#0 ASC=0x44 <<vendor>>ASCQ=0xf9
Jun 28 00:02:17 myhost kernel: sd 0:0:124:0: [sda] tag#0 CDB: Read(10) 28 00 74 70 24 c2 00 00 02 00
Jun 28 00:02:19 myhost kernel: sd 0:0:124:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 00:02:19 myhost kernel: sd 0:0:124:0: [sda] tag#0 Sense Key : Hardware Error [current] [descriptor]
Jun 28 00:02:19 myhost kernel: sd 0:0:124:0: [sda] tag#0 ASC=0x44 <<vendor>>ASCQ=0xf9
Jun 28 00:02:19 myhost kernel: sd 0:0:124:0: [sda] tag#0 CDB: Read(10) 28 00 00 00 00 42 00 00 02 00
Jun 28 00:02:19 myhost kernel: sd 0:0:124:0: [sda] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 00:02:19 myhost kernel: sd 0:0:124:0: [sda] tag#1 Sense Key : Hardware Error [current] [descriptor]
Jun 28 00:02:19 myhost kernel: sd 0:0:124:0: [sda] tag#1 ASC=0x44 <<vendor>>ASCQ=0xf9
Jun 28 00:02:19 myhost kernel: sd 0:0:124:0: [sda] tag#1 CDB: Read(10) 28 00 74 70 25 02 00 00 02 00
...
Jun 28 00:02:17 myhost kernel: sd 11:0:123:0: [sddd] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 00:02:17 myhost kernel: sd 11:0:123:0: [sddd] tag#0 Sense Key : Hardware Error [current] [descriptor]
Jun 28 00:02:17 myhost kernel: sd 11:0:123:0: [sddd] tag#0 ASC=0x44 <<vendor>>ASCQ=0xf9
Jun 28 00:02:17 myhost kernel: sd 11:0:123:0: [sddd] tag#0 CDB: Read(10) 28 00 00 00 00 42 00 00 02 00
Jun 28 00:02:17 myhost kernel: sd 11:0:123:0: [sddd] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 00:02:17 myhost kernel: sd 11:0:123:0: [sddd] tag#1 Sense Key : Hardware Error [current] [descriptor]
Jun 28 00:02:17 myhost kernel: sd 11:0:123:0: [sddd] tag#1 ASC=0x44 <<vendor>>ASCQ=0xf9
Jun 28 00:02:17 myhost kernel: sd 11:0:123:0: [sddd] tag#1 CDB: Read(10) 28 00 74 70 25 02 00 00 02 00
Jun 28 00:02:19 myhost kernel: sd 11:0:123:0: [sddd] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 28 00:02:19 myhost kernel: sd 11:0:123:0: [sddd] tag#0 Sense Key : Hardware Error [current] [descriptor]
Jun 28 00:02:19 myhost kernel: sd 11:0:123:0: [sddd] tag#0 ASC=0x44 <<vendor>>ASCQ=0xf9
Jun 28 00:02:19 myhost kernel: sd 11:0:123:0: [sddd] tag#0 CDB: Read(10) 28 00 74 70 24 c2 00 00 02 00
And some multipath TUR checker errors as the drive started to take a dirt nap. Failures are rarely graceful.
Jun 28 00:02:40 myhost multipathd: 35000cca0990668d8: sda - tur checker reports path is down
Jun 28 00:02:40 myhost multipathd: checker failed path 8:0 in map 35000cca0990668d8
Jun 28 00:02:40 myhost kernel: device-mapper: multipath: Failing path 8:0.
Jun 28 00:02:40 myhost multipathd: 35000cca0990668d8: remaining active paths: 1
Jun 28 00:02:49 myhost multipathd: 35000cca0990668d8: sddd - tur checker reports path is down
Jun 28 00:02:49 myhost kernel: device-mapper: multipath: Failing path 70:176.
Jun 28 00:02:49 myhost multipathd: checker failed path 70:176 in map 35000cca0990668d8
Jun 28 00:02:49 myhost multipathd: 35000cca0990668d8: remaining active paths: 0
Then correlate the above time stamps with the zpool events
time stamps:
Jun 28 2019 00:02:17.499553228 ereport.fs.zfs.io
Jun 28 2019 00:02:17.755552169 ereport.fs.zfs.io
Jun 28 2019 00:02:17.766552123 ereport.fs.zfs.io
Jun 28 2019 00:02:17.766552123 ereport.fs.zfs.probe_failure
Jun 28 2019 00:02:19.102546594 ereport.fs.zfs.io
Jun 28 2019 00:02:19.103546590 ereport.fs.zfs.io
Jun 28 2019 00:02:19.107546574 ereport.fs.zfs.io
Jun 28 2019 00:02:19.107546574 ereport.fs.zfs.probe_failure
Jun 28 2019 00:02:19.107546574 resource.fs.zfs.statechange
Jun 28 2019 00:02:40.751457010 resource.fs.zfs.statechange
Jun 28 2019 00:02:40.758456981 ereport.fs.zfs.io
Jun 28 2019 00:02:40.760456973 ereport.fs.zfs.io
Jun 28 2019 00:02:40.761456969 ereport.fs.zfs.io
Jun 28 2019 00:02:40.761456969 ereport.fs.zfs.probe_failure
Jun 28 2019 00:02:40.761456969 resource.fs.zfs.statechange
Jun 28 2019 00:02:41.737452930 sysevent.fs.zfs.config_sync
Jun 28 2019 00:02:41.738452926 resource.fs.zfs.statechange
Jun 28 2019 00:02:41.739452921 ereport.fs.zfs.io
Jun 28 2019 00:02:41.740452917 ereport.fs.zfs.io
Jun 28 2019 00:02:41.741452913 ereport.fs.zfs.io
Jun 28 2019 00:02:41.741452913 ereport.fs.zfs.probe_failure
Jun 28 2019 00:02:41.741452913 resource.fs.zfs.statechange
Jun 28 2019 00:02:42.204450997 sysevent.fs.zfs.config_sync
Jun 28 2019 00:02:49.714419921 resource.fs.zfs.statechange
Jun 28 2019 00:02:49.716419912 ereport.fs.zfs.io
Jun 28 2019 00:02:49.717419908 ereport.fs.zfs.io
Jun 28 2019 00:02:49.719419900 ereport.fs.zfs.io
Jun 28 2019 00:02:49.719419900 ereport.fs.zfs.probe_failure
Jun 28 2019 00:02:49.719419900 resource.fs.zfs.statechange
Jun 28 2019 00:02:50.212417860 sysevent.fs.zfs.config_sync
Jun 28 2019 00:02:50.226417802 resource.fs.zfs.statechange
Jun 28 2019 00:02:50.226417802 ereport.fs.zfs.io
Jun 28 2019 00:02:50.226417802 ereport.fs.zfs.io
Jun 28 2019 00:02:50.230417785 ereport.fs.zfs.io
Jun 28 2019 00:02:50.230417785 ereport.fs.zfs.probe_failure
Jun 28 2019 00:02:50.230417785 resource.fs.zfs.statechange
Jun 28 2019 00:02:50.674415948 sysevent.fs.zfs.config_sync
Here is a pastebin of the zpool events -v
I would think that ZED would've seen what it needed and engaged the S1 hot-spare after on the statechange at 00:02:19.107546574.
Jun 28 2019 00:02:19.107546574 ereport.fs.zfs.probe_failure
class = "ereport.fs.zfs.probe_failure"
ena = 0x68ef16d804700c01
detector = (embedded nvlist)
version = 0x0
scheme = "zfs"
pool = 0x1e0d9667dbec61a2
vdev = 0x11475c335622fbc9
(end detector)
pool = "mypool"
pool_guid = 0x1e0d9667dbec61a2
pool_state = 0x0
pool_context = 0x0
pool_failmode = "wait"
vdev_guid = 0x11475c335622fbc9
vdev_type = "disk"
vdev_path = "/dev/disk/by-vdev/A0"
vdev_devid = "dm-uuid-mpath-35000cca0990668d8"
vdev_enc_sysfs_path = "/sys/class/enclosure/0:0:123:0/SLOT 00,VAG3J8KL "
vdev_ashift = 0xc
vdev_complete_ts = 0x6d68ef16cd896
vdev_delta_ts = 0x5babb3
vdev_read_errors = 0x6
vdev_write_errors = 0x1
vdev_cksum_errors = 0x0
parent_guid = 0x21e530e68cb794cf
parent_type = "raidz"
vdev_spare_paths = "/dev/disk/by-vdev/S1"
vdev_spare_guids = 0x8637292f9b87e99e
prev_state = 0x0
time = 0x5d15914b 0x66907ce
eid = 0x8c
Jun 28 2019 00:02:19.107546574 resource.fs.zfs.statechange
version = 0x0
class = "resource.fs.zfs.statechange"
pool = "mypool"
pool_guid = 0x1e0d9667dbec61a2
pool_state = 0x0
pool_context = 0x0
vdev_guid = 0x11475c335622fbc9
vdev_state = "FAULTED" (0x5)
vdev_path = "/dev/disk/by-vdev/A0"
vdev_devid = "dm-uuid-mpath-35000cca0990668d8"
vdev_physpath = "A0"
vdev_enc_sysfs_path = "/sys/class/enclosure/0:0:123:0/SLOT 00,VAG3J8KL "
vdev_laststate = "ONLINE" (0x7)
time = 0x5d15914b 0x66907ce
eid = 0x8d
ZED saw something it liked as it turned on the drive slot fault LED at 00:02.19
Jun 28 00:02:17 myhost zed: eid=133 class=io pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0
Jun 28 00:02:17 myhost zed: eid=134 class=io pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0
Jun 28 00:02:17 myhost zed: eid=135 class=io pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0
Jun 28 00:02:17 myhost zed: eid=136 class=probe_failure pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0
Jun 28 00:02:19 myhost zed: eid=137 class=io pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0
Jun 28 00:02:19 myhost zed: eid=138 class=io pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0
Jun 28 00:02:19 myhost zed: eid=139 class=io pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0
Jun 28 00:02:19 myhost zed: eid=140 class=probe_failure pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0
Jun 28 00:02:19 myhost zed: eid=141 class=statechange pool_guid=0x1E0D9667DBEC61A2 vdev_path=/dev/disk/by-vdev/A0 vdev_state=FAULTED
Jun 28 00:02:19 myhost zed: vdev A0 set '/sys/class/enclosure/0:0:123:0/SLOT 00,VAG3J8KL /fault' LED to 1
I've also recently experienced failure to activate a spare on 0.7.13 under similar conditions (multipathd handling the drives). I can't offer a zevents because a recent zfs recv spilled over the events buffer.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
poke
This still appears to be a defect, others reporting mixed results. We are seeing it at customer sites as well currently.
Any thoughts on what/where to dig on this?
@AeonJJohnson we have made a couple fixes in this area (4b5c9d9f976375ca4f075a533d2b6f0c9a14cf4a 0aacde2e9a818942f4e25ab765b3a4849aac6de8 32366649d3164e994def7e5a26765ca6bc1e2315) over the last year or so, if you can reproduce this on demand I think it'd be worth trying the 2.1-rc6 release. In particular, I think 0aacde2e9a818942f4e25ab765b3a4849aac6de8 could sort this out since it updates the kernel modules to explicitly issue a removal event when they determine the vdev is gone. The ZED should already have been getting a udev notification for this, but if for some reason it got lost the spare might not get kicked in (and there's no harm is sending it twice).
The statechange
event itself isn't enough to cause a hot spare to be kicked in. If you're willing to dig in to the code I'd suggest starting with the zfs_retire_recv()
function, this is what should be called when the ZED determines a device should get replaced by a hot spare.
Another potentially related change is https://github.com/openzfs/zfs/pull/11355.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
ZFS config environment
Describe the problem you're observing
ZED will not engage hot spare upon drive failure. Failing drive gets faulted, pool goes degraded, zpool history shows state change, ZED sees failure and sets enclosure fault LED but hot spare is not engaged and resilver to hot spare does not start.
Describe how to reproduce the problem
Build a pool. Wait for drive to fail blocks. Profit.
Include any warning/errors/backtraces from the system logs
ZED sees the fault state change on the failing drive and turns on its chassis LED
Pool events show state change:
The pool has a dedicated hot spare that is available
Running zed -F -v indicates that zed can properly see the current degraded state of the pool
I am able to manually resilver to the hot-spare by running: zpool replace mypool A0 S1
I have seen this multiple times and am unable to track down why ZED won't engage the hot-spare S1 and start the resilver. There are no logs indicating any errors or failures encountered in an attempt to engage the hot-spare drive.
Autoreplace is set to off, which from everything I know and have read is not a requirement for zed to engage and resilver to a hot-spare upon drive fault.
I have seen reference to needing to specifically set hot-spare action in zed.rc but the 2014 references to this method do not appear in the 0.7.9 code tree so that appears to be a red herring: