openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.65k stars 1.75k forks source link

No zed alert if pool is degraded or if vdev unavailable #10123

Open exitcomestothis opened 4 years ago

exitcomestothis commented 4 years ago

System information

Type Version/Name
Distribution Name ProxmoxVE
Distribution Version 6.1-3
Linux Kernel 5.3.10-1-pve
Architecture x86-64
ZFS Version 0.8.2-pve2
SPL Version 0.8.2-pve2

Describe the problem you're observing

If I physically remove a drive from the array to simulate a disk completely failing, ZED does not send an alert regarding the pool being in a degraded state. My system is indeed configured to send emails properly, and I receive emails on scrub completion and resilvering being completed, just not issues like this.

I see that there's some other people with this issue, but these have been for "offline" vdevs and not "unavail" vdevs.

Is there an option for zed to alert if the pool is degraded, regardless of how it was degraded?

I'm using a Dell R515 server with 128gb ecc ram, there's 8x WDRE drives connected to a PERC H310 that's been flashed to IT mode.

zed.rc config:

ZED_DEBUG_LOG="/tmp/zed.debug.log"
ZED_EMAIL_ADDR="myemail@fqdn.com"
ZED_EMAIL_PROG="mail"
ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"
ZED_LOCKDIR="/var/lock"
ZED_NOTIFY_INTERVAL_SECS=3200
ZED_NOTIFY_VERBOSE=1
ZED_NOTIFY_DATA=1
ZED_RUNDIR="/var/run"
ZED_USE_ENCLOSURE_LEDS=1
ZED_SYSLOG_TAG="ZFS-ZED"

zpool status - before drive removed

zpool status
  pool: zfsTest
 state: ONLINE
  scan: scrub repaired 0B in 0 days 00:04:29 with 0 errors on Wed Mar 11 18:33:28 2020
config:

        NAME                        STATE     READ WRITE CKSUM
        zfsTest                     ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x50014ee266e22f93  ONLINE       0     0     0
            wwn-0x50014ee2b36b4fea  ONLINE       0     0     0
            wwn-0x50014ee0ae6dce56  ONLINE       0     0     0
            wwn-0x50014ee059175903  ONLINE       0     0     0
            wwn-0x50014ee2b367975e  ONLINE       0     0     0
            wwn-0x50014ee0595d1d74  ONLINE       0     0     0
            wwn-0x50014ee059174070  ONLINE       0     0     0
            wwn-0x50014ee605b5455f  ONLINE       0     0     0

errors: No known data errors

zpool status - after drive was removed

root@pve:~# zpool status
  pool: zfsTest
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 0 days 00:04:29 with 0 errors on Wed Mar 11 18:33:28 2020
config:

        NAME                        STATE     READ WRITE CKSUM
        zfsTest                     DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            wwn-0x50014ee266e22f93  ONLINE       0     0     0
            5059338246820777521     UNAVAIL      0     0     0  was /dev/disk/by-id/wwn-0x50014ee2b36b4fea-part1
            wwn-0x50014ee0ae6dce56  ONLINE       0     0     0
            wwn-0x50014ee059175903  ONLINE       0     0     0
            wwn-0x50014ee2b367975e  ONLINE       0     0     0
            wwn-0x50014ee0595d1d74  ONLINE       0     0     0
            wwn-0x50014ee059174070  ONLINE       0     0     0
            wwn-0x50014ee605b5455f  ONLINE       0     0     0

errors: No known data errors

Describe how to reproduce the problem

Creat raid-z2 array within proxmox; load data onto array (I'm running just 1 linux VM); power off system and remove drive; power on system.

Include any warning/errors/backtraces from the system logs

Syslog after drive removed

Mar 11 18:52:05 pve ZFS-ZED: eid=1 class=history_event pool_guid=0x40155F4D82C26E64 Mar 11 18:52:05 pve ZFS-ZED: eid=2 class=config_sync pool_guid=0x40155F4D82C26E64 Mar 11 18:52:05 pve ZFS-ZED: eid=3 class=pool_import pool_guid=0x40155F4D82C26E64 Mar 11 18:52:05 pve ZFS-ZED: eid=4 class=history_event pool_guid=0x40155F4D82C26E64 Mar 11 18:52:05 pve ZFS-ZED: eid=5 class=config_sync pool_guid=0x40155F4D82C26E64

exitcomestothis commented 4 years ago

After doing a little more research, it seems that this issue was brought up and closed as fixed back in 2017 per #4653. Granted, the issue in #4653 was for corrupted metadata, the disk was still showing as 'UNAVAIL' as it is in my case.

The last comment by @tonyhutter indicates that after this fix, zed must be running. I double checked using ps ax | grep zed and service zfs-zed status and it seems to indeed be running.

Is this possibly a bug that was resolved but has now reared it's head again?

exitcomestothis commented 4 years ago

Doing a little more testing on this in regards to faulting the drive on bad IO's from #4653 above. I tried saving 4 vm's totaling about 1tb to my pool, but still didn't receive any error from zed regarding the pool being degraded.

root@pve:~# zpool status -x
  pool: zfsTest
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 0 days 00:04:29 with 0 errors on Wed Mar 11 18:33:28 2020
config:

        NAME                        STATE     READ WRITE CKSUM
        zfsTest                     DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            wwn-0x50014ee266e22f93  ONLINE       0     0     0
            5059338246820777521     UNAVAIL      0     0     0  was /dev/disk/by-id/wwn-0x50014ee2b36b4fea-part1
            wwn-0x50014ee0ae6dce56  ONLINE       0     0     0
            wwn-0x50014ee059175903  ONLINE       0     0     0
            wwn-0x50014ee2b367975e  ONLINE       0     0     0
            wwn-0x50014ee0595d1d74  ONLINE       0     0     0
            wwn-0x50014ee059174070  ONLINE       0     0     0
            wwn-0x50014ee605b5455f  ONLINE       0     0     0
exitcomestothis commented 4 years ago

Have been letting these VM's run for the last 5 days, but there's still no alert from zed regarding the pool being degraded.

cbane commented 4 years ago

I just ran in to this problem myself. The problem is caused by the statechange-notify.sh zedlet only sending a notification for states FAULTED, DEGRADED, and REMOVED; unplugging a disk results in state UNAVAIL.

I have a modified version of the zedlet that I'm testing out. I'm planning to open a pull request with this change soon, assuming that it works correctly.

exitcomestothis commented 4 years ago

Hey Courtney! Glad to hear it wasn't just me that was experiencing this issue.

I also looked at that script and am pretty confident I added a line to include the UNAVAIL status, but zed still didn't detect it until after a restart. Maybe there's code outside of this script itself that also ignores the UNAVAIL state?

I ended up resolving this by using a script I found from a user in the proxmox forums and have it running as a cronjob every 5 mins to check the status. I also modied this script to make a 2nd one that runs once a night and emails me the status of the pool, just for good measure. Both are attached below. zfs.zip

Zixim commented 4 years ago

ran into this today... Nothing happening for that PR, @cbane ?

exitcomestothis commented 4 years ago

Just an update on this.

Unfortunately I had a drive fail on me yesterday at around 11:45pm and was greeted with almost 3 dozen emails the follwing AM. I'm thankful to report that ZED reported my pool had an issue. I replaced the drive, ZFS resilvered, and about 3hrs later everything's wonderful again. F@ck yeah @openzfs!!! Way to go!

I'd also like to thank WD (@westerndigital) for making a small exception to their RMA process due to this drive technically failing on the LAST day within warranty but me not noticing it until the next day (unfortunately I do need sleep). This is why I have always, and will continue to use WD drives. Thanks again @westerndigital!!

My first email was stating that the drive in question was "faulted" due to "too many errors". My server is a Dell R515 (proxmox 6.1-3, 128gb ram) and has an 8 bay hot swap backplane. As soon as I removed the failed drive from the system, I received another email from ZED stating that the device had been faulted.

All in all, it seems that ZED is indeed working to a large degree, as this poor WD Gold drive is indeed toast.

Just wanted to share my experience on a real world ZED failure alert.

Here's an output of the (many) failure emails:

`NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT Vol1 14.5T 663G 13.9T - - 4% 4% 1.00x DEGRADED -

pool: Vol1 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub repaired 0B in 0 days 00:45:51 with 0 errors on Sun Jul 12 01:09:52 2020 config:

    NAME                        STATE     READ WRITE CKSUM
    zfsTest                     DEGRADED     0     0     0
      raidz2-0                  DEGRADED     0     0     0
        wwn-0x50014ee266e22f93  ONLINE       0     0     0
        wwn-0x50014ee2b36b4fea  ONLINE       0     0     0
        wwn-0x50014ee0ae6dce56  ONLINE       0     0     0
        wwn-0x50014ee059175903  ONLINE       0     0     0
        wwn-0x50014ee2b367975e  ONLINE       0     0     0
        wwn-0x50014ee0595d1d74  FAULTED      9     3     0  too many errors
        wwn-0x50014ee059174070  ONLINE       0     0     0
        wwn-0x50014ee605b5455f  ONLINE       0     0     0

errors: No known data errors`

Output of the fixed email:

`ZFS has finished a resilver:

eid: 109 class: resilver_finish host: pve time: 2020-07-29 11:59:43-0700 pool: Vol1 state: DEGRADED scan: resilvered 62.0G in 0 days 01:00:26 with 0 errors on Wed Jul 29 11:59:43 2020 config:

    NAME                          STATE     READ WRITE CKSUM
    Vol1                          DEGRADED     0     0     0
      raidz2-0                    DEGRADED     0     0     0
        wwn-0x50014ee266e22f93    ONLINE       0     0     0
        wwn-0x50014ee2b36b4fea    ONLINE       0     0     0
        wwn-0x50014ee0ae6dce56    ONLINE       0     0     0
        wwn-0x50014ee059175903    ONLINE       0     0     0
        wwn-0x50014ee2b367975e    ONLINE       0     0     0
        replacing-5               DEGRADED     0     0     0
          wwn-0x50014ee0595d1d74  FAULTED      9     3     0  too many errors
          sdg                     ONLINE       0     0     0
        wwn-0x50014ee059174070    ONLINE       0     0     0
        wwn-0x50014ee605b5455f    ONLINE       0     0     0

errors: No known data errors`

popcorn9499 commented 3 years ago

I came across this today after experiencing the same issue. Decided to poke around cbane's github and see if they had it in a repo somewhere and sure enough they did. https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096 this is the commit. It is pretty dead simple to add anyways.

Hopefully either we see this added to the repo or something. I would submit the pull request but its not my code so I really do not want to butt heads.

behlendorf commented 3 years ago

@cbane if you can open a PR with your fix I'd be happy to merge it.

brainrecall commented 3 years ago

I just experienced this last night, One of my Intel SSDs in a zfs mirror disconnected. The drives are used, and seemed to have been hammered by the previous owner. The SSDs were showing signs of issues because I've been watching their wearout indicators slowly tick up over the last couple months, so I have replacements on the way. I was surprised to find the array degraded and no email warning, though I do have emails configured and working. The Intel SSD just showed as unavailable, disconnected, not responding. That is actually a common failure mode for SSDs, and absolutely should be a mode that triggers an email since it degrades the pool.

For those interested, in my case I was able to manually pull the bad drive, plug it right back in, and it came back to life. I was able to offline/online the drive to trigger a resilver and the pool went back to healthy with a clean scrub. Fingers crossed until the replacements come in.

I'm going to manually apply @cbane patch, but I hope someone else picks this up if @cbane is too busy; this is critical in my opinion.

brainrecall commented 3 years ago

The more I think about this, the more I wonder. Why is the statechange script checking only certain vdev states? At the very least, it should be reduced to checking for "not online" rather than specific states (for example, REMOVED is also not considered).

And, I wonder why this is only checking vdev state. Arguably, this script should be only checking pool status and sending emails based on the state of the pool, not individual vdevs in the pool. Unless I'm zfs ignorant and certain configurations can have unhealthy drives but still a healthy pool...

cholzer79 commented 3 years ago

I was tearing my hair out for the last 30 minutes trying to figure out why on my testsystem I would not get "degraded" e-mails, while I did get emails once I reconnected the disk.

I am shocked that this issue exisist since 2020 and was not fixed yet......

exitcomestothis commented 3 years ago

I've edited the statechange-notify.sh file by hand way back when I filed #10123, however even after editing and testing today on a new machine, I still don't get any sort of notice, aside from a resilver finish event. I'm running proxmox 7.0-13, and zfs zfs-2.0.6-pve1~bpo10+1 zfs-kmod-2.0.5-pve1.

Adding the changes in #12630 don't result in any notifications for me.

If anyone would like to have me do further testing on this, I'm available to do so, as I have a multi drive server available for testing, and would really like to see this (critical IMO) bug squashed.

masipcat commented 3 years ago

@exitcomestothis I run Proxmox 7 with ZFS as well and found the same problem a couple of months ago. I reviewed my notes and I noticed that apart of applying https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096 I changed these lines in /etc/zfs/zed.d/zed.rc:

-ZED_NOTIFY_VERBOSE=0
+ZED_NOTIFY_VERBOSE=1

-ZED_SYSLOG_SUBCLASS_INCLUDE="checksum|scrub_*|vdev.*"
+ZED_SYSLOG_SUBCLASS_INCLUDE="*"
cholzer79 commented 3 years ago

@exitcomestothis I run Proxmox 7 with ZFS as well and found the same problem a couple of months ago. I reviewed my notes and I noticed that I changed this lines in /etc/zfs/zed.d/zed.rc:

-ZED_NOTIFY_VERBOSE=0
+ZED_NOTIFY_VERBOSE=1

-ZED_SYSLOG_SUBCLASS_INCLUDE="checksum|scrub_*|vdev.*"
+ZED_SYSLOG_SUBCLASS_INCLUDE="*"

With this in place I DONT get a notification when a drive is disconnected.

Only when I add UNAVAIL as per this patch, it works. https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096

cholzer79 commented 3 years ago

I've edited the statechange-notify.sh file by hand way back when I filed #10123, however even after editing and testing today on a new machine, I still don't get any sort of notice, aside from a resilver finish event. I'm running proxmox 7.0-13, and zfs zfs-2.0.6-pve1~bpo10+1 zfs-kmod-2.0.5-pve1.

Adding the changes in #12630 don't result in any notifications for me.

If anyone would like to have me do further testing on this, I'm available to do so, as I have a multi drive server available for testing, and would really like to see this (critical IMO) bug squashed.

I just did another test install, added these changes https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096 And now I do get a notification when I disconect a drive. Without these edits I don't get a notification when a drive drops out.

masipcat commented 3 years ago

I just did another test install, added these changes cbane@f4f1638 And now I do get a notification when I disconect a drive. Without these edits I don't get a notification when a drive drops out.

Yeah, I assumed this patch https://github.com/cbane/zfs/commit/f4f16389413061ed0b670df1cbd17954518a3096 was already applied. I edited the original comment to clarify it.

OSHW-Rico commented 2 years ago

So it is not fixed in proxmox 7.1?

I just tested it myself and i don't get mails for disconnected drives. But for resilvering...

Anyway an awesome piece of software, could disconnect and reconnect on the fly while moving big files and no corruption in the end. But an alert would still be nice.

behlendorf commented 2 years ago

This fix was included as of the OpenZFS 2.1.3 release. Commit 487bb7762.

githubDiversity commented 1 year ago

I am running proxmox 7.2.11 zfs-zed/stable,now 2.1.6-pve1 amd64

After taking out the only disk out of a hot swap bay in a single disk pool. No email was sent. I am not sure though that this is ZED related though or ZFS itself as zpool status still listed the pool as online including that disk that I just took out. Either way it makes me feel very uncomfortable.

Should I report this here, over at the proxmox community? I mean they are using their own flavor of the packages it seems zfs-initramfs/stable,now 2.1.6-pve1 all zfsutils-linux/stable,now 2.1.6-pve1 amd64 that came installed with proxmox all have a pve1 postfix to the package name

diakritikus commented 1 year ago

This is either a resurrection bug that has more lives than a cat does or people including myself keep misunderstanding expected behavior of ZED alerts for degraded ZFS pools.

I'm also expecting but not receiving the degraded pool email alert when I boot my system after removing a disk from a pool that makes it degraded, while I'm getting the alert after resilvering completes as expected (well most of the time, exceptions are below).

Here is a recent and thorough series of manual tests from the Proxmox community that wraps it up very well: https://forum.proxmox.com/threads/no-email-notification-for-zfs-status-degraded.87629/post-520096

Note that the resilvering completed alert is also not always sent if the resilvering was very quick and I've also seen this on my system recently eg. when resilvering took 5s and it was not triggering an alert email as opposed to longer resilvering sending the email as expected.

I'm running proxmox-ve: 7.3-1 (running kernel: 5.15.102-1-pve) and zfs-zed 2.1.9-pve1