Open exitcomestothis opened 4 years ago
After doing a little more research, it seems that this issue was brought up and closed as fixed back in 2017 per #4653. Granted, the issue in #4653 was for corrupted metadata, the disk was still showing as 'UNAVAIL' as it is in my case.
The last comment by @tonyhutter indicates that after this fix, zed must be running. I double checked using ps ax | grep zed
and service zfs-zed status
and it seems to indeed be running.
Is this possibly a bug that was resolved but has now reared it's head again?
Doing a little more testing on this in regards to faulting the drive on bad IO's from #4653 above. I tried saving 4 vm's totaling about 1tb to my pool, but still didn't receive any error from zed regarding the pool being degraded.
root@pve:~# zpool status -x
pool: zfsTest
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0B in 0 days 00:04:29 with 0 errors on Wed Mar 11 18:33:28 2020
config:
NAME STATE READ WRITE CKSUM
zfsTest DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
wwn-0x50014ee266e22f93 ONLINE 0 0 0
5059338246820777521 UNAVAIL 0 0 0 was /dev/disk/by-id/wwn-0x50014ee2b36b4fea-part1
wwn-0x50014ee0ae6dce56 ONLINE 0 0 0
wwn-0x50014ee059175903 ONLINE 0 0 0
wwn-0x50014ee2b367975e ONLINE 0 0 0
wwn-0x50014ee0595d1d74 ONLINE 0 0 0
wwn-0x50014ee059174070 ONLINE 0 0 0
wwn-0x50014ee605b5455f ONLINE 0 0 0
Have been letting these VM's run for the last 5 days, but there's still no alert from zed regarding the pool being degraded.
I just ran in to this problem myself. The problem is caused by the statechange-notify.sh
zedlet only sending a notification for states FAULTED
, DEGRADED
, and REMOVED
; unplugging a disk results in state UNAVAIL
.
I have a modified version of the zedlet that I'm testing out. I'm planning to open a pull request with this change soon, assuming that it works correctly.
Hey Courtney! Glad to hear it wasn't just me that was experiencing this issue.
I also looked at that script and am pretty confident I added a line to include the UNAVAIL
status, but zed still didn't detect it until after a restart. Maybe there's code outside of this script itself that also ignores the UNAVAIL
state?
I ended up resolving this by using a script I found from a user in the proxmox forums and have it running as a cronjob every 5 mins to check the status. I also modied this script to make a 2nd one that runs once a night and emails me the status of the pool, just for good measure. Both are attached below. zfs.zip
ran into this today... Nothing happening for that PR, @cbane ?
Just an update on this.
Unfortunately I had a drive fail on me yesterday at around 11:45pm and was greeted with almost 3 dozen emails the follwing AM. I'm thankful to report that ZED reported my pool had an issue. I replaced the drive, ZFS resilvered, and about 3hrs later everything's wonderful again. F@ck yeah @openzfs!!! Way to go!
I'd also like to thank WD (@westerndigital) for making a small exception to their RMA process due to this drive technically failing on the LAST day within warranty but me not noticing it until the next day (unfortunately I do need sleep). This is why I have always, and will continue to use WD drives. Thanks again @westerndigital!!
My first email was stating that the drive in question was "faulted" due to "too many errors". My server is a Dell R515 (proxmox 6.1-3, 128gb ram) and has an 8 bay hot swap backplane. As soon as I removed the failed drive from the system, I received another email from ZED stating that the device had been faulted.
All in all, it seems that ZED is indeed working to a large degree, as this poor WD Gold drive is indeed toast.
Just wanted to share my experience on a real world ZED failure alert.
Here's an output of the (many) failure emails:
`NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT Vol1 14.5T 663G 13.9T - - 4% 4% 1.00x DEGRADED -
pool: Vol1 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 0 days 00:45:51 with 0 errors on Sun Jul 12 01:09:52 2020 config:
NAME STATE READ WRITE CKSUM
zfsTest DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
wwn-0x50014ee266e22f93 ONLINE 0 0 0
wwn-0x50014ee2b36b4fea ONLINE 0 0 0
wwn-0x50014ee0ae6dce56 ONLINE 0 0 0
wwn-0x50014ee059175903 ONLINE 0 0 0
wwn-0x50014ee2b367975e ONLINE 0 0 0
wwn-0x50014ee0595d1d74 FAULTED 9 3 0 too many errors
wwn-0x50014ee059174070 ONLINE 0 0 0
wwn-0x50014ee605b5455f ONLINE 0 0 0
errors: No known data errors`
Output of the fixed email:
`ZFS has finished a resilver:
eid: 109 class: resilver_finish host: pve time: 2020-07-29 11:59:43-0700 pool: Vol1 state: DEGRADED scan: resilvered 62.0G in 0 days 01:00:26 with 0 errors on Wed Jul 29 11:59:43 2020 config:
NAME STATE READ WRITE CKSUM
Vol1 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
wwn-0x50014ee266e22f93 ONLINE 0 0 0
wwn-0x50014ee2b36b4fea ONLINE 0 0 0
wwn-0x50014ee0ae6dce56 ONLINE 0 0 0
wwn-0x50014ee059175903 ONLINE 0 0 0
wwn-0x50014ee2b367975e ONLINE 0 0 0
replacing-5 DEGRADED 0 0 0
wwn-0x50014ee0595d1d74 FAULTED 9 3 0 too many errors
sdg ONLINE 0 0 0
wwn-0x50014ee059174070 ONLINE 0 0 0
wwn-0x50014ee605b5455f ONLINE 0 0 0
errors: No known data errors`
I came across this today after experiencing the same issue. Decided to poke around cbane's github and see if they had it in a repo somewhere and sure enough they did. https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096 this is the commit. It is pretty dead simple to add anyways.
Hopefully either we see this added to the repo or something. I would submit the pull request but its not my code so I really do not want to butt heads.
@cbane if you can open a PR with your fix I'd be happy to merge it.
I just experienced this last night, One of my Intel SSDs in a zfs mirror disconnected. The drives are used, and seemed to have been hammered by the previous owner. The SSDs were showing signs of issues because I've been watching their wearout indicators slowly tick up over the last couple months, so I have replacements on the way. I was surprised to find the array degraded and no email warning, though I do have emails configured and working. The Intel SSD just showed as unavailable, disconnected, not responding. That is actually a common failure mode for SSDs, and absolutely should be a mode that triggers an email since it degrades the pool.
For those interested, in my case I was able to manually pull the bad drive, plug it right back in, and it came back to life. I was able to offline/online the drive to trigger a resilver and the pool went back to healthy with a clean scrub. Fingers crossed until the replacements come in.
I'm going to manually apply @cbane patch, but I hope someone else picks this up if @cbane is too busy; this is critical in my opinion.
The more I think about this, the more I wonder. Why is the statechange script checking only certain vdev states? At the very least, it should be reduced to checking for "not online" rather than specific states (for example, REMOVED is also not considered).
And, I wonder why this is only checking vdev state. Arguably, this script should be only checking pool status and sending emails based on the state of the pool, not individual vdevs in the pool. Unless I'm zfs ignorant and certain configurations can have unhealthy drives but still a healthy pool...
I was tearing my hair out for the last 30 minutes trying to figure out why on my testsystem I would not get "degraded" e-mails, while I did get emails once I reconnected the disk.
I am shocked that this issue exisist since 2020 and was not fixed yet......
I've edited the statechange-notify.sh file by hand way back when I filed #10123, however even after editing and testing today on a new machine, I still don't get any sort of notice, aside from a resilver finish event. I'm running proxmox 7.0-13, and zfs zfs-2.0.6-pve1~bpo10+1 zfs-kmod-2.0.5-pve1.
Adding the changes in #12630 don't result in any notifications for me.
If anyone would like to have me do further testing on this, I'm available to do so, as I have a multi drive server available for testing, and would really like to see this (critical IMO) bug squashed.
@exitcomestothis I run Proxmox 7 with ZFS as well and found the same problem a couple of months ago. I reviewed my notes and I noticed that apart of applying https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096 I changed these lines in /etc/zfs/zed.d/zed.rc
:
-ZED_NOTIFY_VERBOSE=0
+ZED_NOTIFY_VERBOSE=1
-ZED_SYSLOG_SUBCLASS_INCLUDE="checksum|scrub_*|vdev.*"
+ZED_SYSLOG_SUBCLASS_INCLUDE="*"
@exitcomestothis I run Proxmox 7 with ZFS as well and found the same problem a couple of months ago. I reviewed my notes and I noticed that I changed this lines in
/etc/zfs/zed.d/zed.rc
:-ZED_NOTIFY_VERBOSE=0 +ZED_NOTIFY_VERBOSE=1 -ZED_SYSLOG_SUBCLASS_INCLUDE="checksum|scrub_*|vdev.*" +ZED_SYSLOG_SUBCLASS_INCLUDE="*"
With this in place I DONT get a notification when a drive is disconnected.
Only when I add UNAVAIL as per this patch, it works. https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096
I've edited the statechange-notify.sh file by hand way back when I filed #10123, however even after editing and testing today on a new machine, I still don't get any sort of notice, aside from a resilver finish event. I'm running proxmox 7.0-13, and zfs zfs-2.0.6-pve1~bpo10+1 zfs-kmod-2.0.5-pve1.
Adding the changes in #12630 don't result in any notifications for me.
If anyone would like to have me do further testing on this, I'm available to do so, as I have a multi drive server available for testing, and would really like to see this (critical IMO) bug squashed.
I just did another test install, added these changes https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096 And now I do get a notification when I disconect a drive. Without these edits I don't get a notification when a drive drops out.
I just did another test install, added these changes cbane@f4f1638 And now I do get a notification when I disconect a drive. Without these edits I don't get a notification when a drive drops out.
Yeah, I assumed this patch https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096 was already applied. I edited the original comment to clarify it.
So it is not fixed in proxmox 7.1?
I just tested it myself and i don't get mails for disconnected drives. But for resilvering...
Anyway an awesome piece of software, could disconnect and reconnect on the fly while moving big files and no corruption in the end. But an alert would still be nice.
This fix was included as of the OpenZFS 2.1.3 release. Commit 487bb7762.
I am running proxmox 7.2.11 zfs-zed/stable,now 2.1.6-pve1 amd64
After taking out the only disk out of a hot swap bay in a single disk pool. No email was sent. I am not sure though that this is ZED related though or ZFS itself as zpool status still listed the pool as online including that disk that I just took out. Either way it makes me feel very uncomfortable.
Should I report this here, over at the proxmox community? I mean they are using their own flavor of the packages it seems zfs-initramfs/stable,now 2.1.6-pve1 all zfsutils-linux/stable,now 2.1.6-pve1 amd64 that came installed with proxmox all have a pve1 postfix to the package name
This is either a resurrection bug that has more lives than a cat does or people including myself keep misunderstanding expected behavior of ZED alerts for degraded ZFS pools.
I'm also expecting but not receiving the degraded pool email alert when I boot my system after removing a disk from a pool that makes it degraded, while I'm getting the alert after resilvering completes as expected (well most of the time, exceptions are below).
Here is a recent and thorough series of manual tests from the Proxmox community that wraps it up very well: https://forum.proxmox.com/threads/no-email-notification-for-zfs-status-degraded.87629/post-520096
Note that the resilvering completed alert is also not always sent if the resilvering was very quick and I've also seen this on my system recently eg. when resilvering took 5s and it was not triggering an alert email as opposed to longer resilvering sending the email as expected.
I'm running proxmox-ve: 7.3-1 (running kernel: 5.15.102-1-pve) and zfs-zed 2.1.9-pve1
System information
Describe the problem you're observing
If I physically remove a drive from the array to simulate a disk completely failing, ZED does not send an alert regarding the pool being in a degraded state. My system is indeed configured to send emails properly, and I receive emails on scrub completion and resilvering being completed, just not issues like this.
I see that there's some other people with this issue, but these have been for "offline" vdevs and not "unavail" vdevs.
Is there an option for zed to alert if the pool is degraded, regardless of how it was degraded?
I'm using a Dell R515 server with 128gb ecc ram, there's 8x WDRE drives connected to a PERC H310 that's been flashed to IT mode.
zed.rc config:
zpool status - before drive removed
zpool status - after drive was removed
Describe how to reproduce the problem
Creat raid-z2 array within proxmox; load data onto array (I'm running just 1 linux VM); power off system and remove drive; power on system.
Include any warning/errors/backtraces from the system logs
Syslog after drive removed