Open cvoltz opened 5 years ago
@tonyhutter can you take a look at this? Thanks
I have a fix for this. I'll generate a PR for it as soon as I am finished running the ZFS test suite on it.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
I'm reopening this since this hasn't yet been addressed to my knowledge.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Problem
When a drive in a pool is
FAULTED
(e.g., due to I/O errors) or the drive goesOFFLINE
(e.g. thezpool offline
command was run), theresource.fs.zfs.statechange
event is generated with thevdev_state
set appropriately. If the drive is brought online (e.g., thezpool online
command was run), theresource.fs.zfs.statechange
event is generated with thevdev_state
set toONLINE
. However, if the drive is replaced using thezpool replace
command, theresource.fs.zfs.statechange
event is not generated.Lustre 2.11 added the ZEDLET
statechange-lustre.sh
which changes theobdfilter.*.degraded
property for a target when the pool's state changes. It sets thedegraded
property if the pool isDEGRADED
and resets the property if the pool isONLINE
. Since ZFS is not always generating the state change event, sometimes the target's degraded property is left set even when the pool isONLINE
, which reduces performance of the Lustre filesystem.See https://jira.whamcloud.com/browse/LU-12836 for more information (including output from
zpool events -v
).Steps to reproduce
ONLINE
:DEGRADED
:ONLINE
:and notice it only has the state change event for the pool going
OFFLINE
instead of also having a state change event for the pool goingONLINE
. The output should have included an event like this:Changing the
zpool replace $pool $bad_drive $spare_drive
command tozpool online $pool $bad_drive
will result in theresource.fs.zfs.statechange
event being generated when the pool goesONLINE
.The Lustre issue includes the test-degraded-drive script which can be used for testing.
While we are looking at this specific scenario, we should investigate whether there are any other scenarios where the pool could change to
ONLINE
but not generate a corresponding state change event.