Open macronet opened 2 years ago
So, IIRC, when someone reported this to me the other day, and I made faces, the key detail from their report that horrified me was, while the disk shrinks to 1 GB, it reportedly did not error on trying to write to LBAs past 1GB?
I didn't get to debug this live, only after it happened and the server in question was already rebooted. After reboot zpool had kicked the disk out (UNAVAIL) but draid-spare was not activated.
Based on monitoring CPU I/O-wait jumped to ~75% & during it something was noticed by zpool since I/O was suspended.
With earlier setup (Solaris) end result was just a sudden full lockup and end result of guessing game was "it tried to write past the 1GB mark and ran out of memory doing so". So either doesn't error, or is suspended (accepts writes but doesn't acknowledge them being done).
Will add more information if/when this happens again (& if able to debug more).
zpool is hung & iowait jumps, doesn't recover by itself.
please describe your pool layout where the shrunken disk is part of.
if there is no redundancy, there is nothing to recover.
if there is redundancy, you are right and it should honoured
24 SSDs, draid2:8d:24c:1s-0
Regarding your specific "Samsung 1GB failure mode": https://blog.muwave.de/2019/09/samsung-ssd-resurrection/
As far as ZFS goes, we could detect a shrunken drive, assuming udev generates a change event for it (and you were running zed). It would look something like:
diff --git a/cmd/zed/agents/zfs_mod.c b/cmd/zed/agents/zfs_mod.c
index 7364dd2..0013b50 100644
--- a/cmd/zed/agents/zfs_mod.c
+++ b/cmd/zed/agents/zfs_mod.c
@@ -1097,6 +1097,14 @@ zfsdle_vdev_online(zpool_handle_t *zhp, void *data)
__func__, fullpath, conf_size,
MAX(udev_size, udev_parent_size),
zpool_get_name(zhp), error);
+ } else if (udev_size < conf_size) {
+ error = zpool_vdev_offline(zhp, fullpath, 0);
+ zed_log_msg(LOG_INFO,
+ "%s: '%s' shrank from %llu"
+ " to %llu bytes in pool '%s': %d",
+ __func__, fullpath, conf_size,
+ MAX(udev_size, udev_parent_size),
+ zpool_get_name(zhp), error);
}
}
}
:arrow_up: This check would actually be outside of the autoexpand block though.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
Samsung SSD fails and drops to ERRORMOD -fw-state - and it's size is suddenly shrank to 1GB. zpool is hung & iowait jumps, doesn't recover by itself.
Describe how to reproduce the problem
Shrink an online disk during normal operation without notifying kernel level/ignoring locks (no idea how to accomplish). Tried to reproduce with LVM but couldn't do it successfully - dmsetup suspend hung the pool is the closest one.
Include any warning/errors/backtraces from the system logs
Unfortunately didn't gather logs during the (quite random) event.