openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

Drives labeled by WWN inconsistently available at mount time? #1103

Closed rdlugosz closed 8 years ago

rdlugosz commented 11 years ago

I've been building out a new storage pool on Ubuntu 12.04 and have run into a peculiar issue that I suspect may be a race condition with drive/label availability at ZFS automount time.

Upon a reboot, a zpool status will often report my pool as degraded, with one or more of the drives listed as unavailable due to label missing. Typically it's just one, but sometimes it is two. I've not seen more than two listed as unavailable (yet). There are no errors related to the drive(s) in dmesg, logs or the BIOS. Issuing a zpool export followed by an immediate zpool import resolves the problem and all drives are listed as available.

A scrub shows no issues, large transfers don't cause any problems and I've tried some of the obvious things like switching out sata cables & ports. Nothing seems to make a difference.

My older pool does not exhibit this behavior at all, even if I run it on the same controller/ports as the new pool. The only major difference (beyond capacity) is that I created the new pool using the WWN label instead of the more typical identifier that includes the interface. Here's my create statement:

sudo zpool create -o ashift=12 -f keg raidz2 /dev/disk/by-id/wwn-0x5000c5004e8f4dd4 /dev/disk/by-id/wwn-0x5000c5004e8c845e /dev/disk/by-id/wwn-0x5000c5004e8eaea7 /dev/disk/by-id/wwn-0x5000c5004e8c76b3

So, at this point I am relatively convinced that there's nothing wrong with the drives but rather there's some kind of timing issue related to when ZFS attempts to mount the pool and the existence of that drive's label in the system...

Any thoughts on how I can debug this? Any reason to suspect that the WWN names are at fault & I should switch them out for the more traditional disk identifier? (And, if I wanted to test that, do I just detatch each drive one by one and attach them again with their alternate label or do I need to use the replace command and have each one rebuilt?)

Here's some info on my current package versions:

ryan@nas:~$ dpkg -l | grep zfs
ii  libzfs1                                0.6.0.86-0ubuntu1~precise1              Native ZFS filesystem library for Linux
ii  mountall                               2.36.1-zfs1                             filesystem mounting tool
ii  ubuntu-zfs                             6~precise                               Native ZFS filesystem metapackage for Ubuntu.
ii  zfs-dkms                               0.6.0.86-0ubuntu1~precise1              Native ZFS filesystem kernel modules for Linux
ii  zfsutils                               0.6.0.86-0ubuntu1~precise1              Native ZFS management utilities for Linux
ryan@nas:~$ uname -a
Linux nas 3.2.0-33-generic #52-Ubuntu SMP Thu Oct 18 16:29:15 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

And some info on the pools:

ryan@nas:~$ sudo zpool status
  pool: keg
 state: ONLINE
 scan: scrub repaired 0 in 1h31m with 0 errors on Wed Nov 21 09:55:35 2012
config:

    NAME                        STATE     READ WRITE CKSUM
    keg                         ONLINE       0     0     0
      raidz2-0                  ONLINE       0     0     0
        wwn-0x5000c5004e8f4dd4  ONLINE       0     0     0
        wwn-0x5000c5004e8c845e  ONLINE       0     0     0
        wwn-0x5000c5004e8eaea7  ONLINE       0     0     0
        wwn-0x5000c5004e8c76b3  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
 scan: scrub repaired 0 in 2h50m with 0 errors on Wed Nov 21 04:21:54 2012
config:

    NAME                                            STATE     READ WRITE CKSUM
    tank                                            ONLINE       0     0     0
      raidz1-0                                      ONLINE       0     0     0
        ata-Hitachi_HDT721010SLA360_STF605MH1K5EPW  ONLINE       0     0     0
        ata-ST31000528AS_9VP1YGNE                   ONLINE       0     0     0
        ata-WDC_WD6400AAKS-75A7B0_WD-WMASY2713477   ONLINE       0     0     0

errors: No known data errors
ryan@nas:~$ sudo zpool list 
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
keg   7.25T  2.01T  5.24T    27%  1.00x  ONLINE  -
tank  1.73T  1.50T   241G    86%  1.08x  ONLINE  -

Output of zdb -C: https://gist.github.com/3a6fc5c86fefd31ca2b8 Output of dmesg: https://gist.github.com/71ca347dacb641aa933a

dajhorn commented 11 years ago

Any thoughts on how I can debug this? Any reason to suspect that the WWN names are at fault & I should switch them out for the more traditional disk identifier? (And, if I wanted to test that, do I just detatch each drive one by one and attach them again with their alternate label or do I need to use the replace command and have each one rebuilt?)

Just do this to isolate for a race on the WWN names:

 # zpool export keg
 # zpool import -d /dev/disk/by-id keg

So, at this point I am relatively convinced that there's nothing wrong with the drives but rather there's some kind of timing issue related to when ZFS attempts to mount the pool and the existence of that drive's label in the system...

I can see three potential secondary problems:

rdlugosz commented 11 years ago

@dajhorn thank you, this is a great set of suggestions. I'll try them out after Thanksgiving & report back as to whether this clears up the issue.

rdlugosz commented 11 years ago

Unfortunately this did not resolve my issue. I did the following:

I did not address the driver loading issue as the only thing connected to this controller is a CDROM (I suspect you mean the pata_jmicron driver that loads just after the "freeing unused kernel memory" message).

I have a band-aid fix that may provide some additional clues: Adding a sleep 30 to the /etc/init/mountall.conf just prior to the exec mountall --daemon $force_fsck $fsck_fix line seems to resolve the issue. UPDATE: This appears to have been a coincidence and not a fix. The drives continue to fail to mount—sometimes—in spite of the delay

That's fine for this server since it rarely reboots, but perhaps there's something that could/should be done in the ZFS to accomodate a hardware configuration that misreports / doesn't report / differently reports a ready state for the drives? I would not recommend a patch specific to this hardware, but maybe there's something in the way that ZFS checks for this (if it checks it at all - maybe this is left to the kernel?) that could be made more conservative.

In case it's helpful: dmesg output: https://gist.github.com/309b79be66ef424e49e9 dmesg output w/ 30s mount delay: https://gist.github.com/912c82ccb5648a1f83b7

Finally, in case your Monday isn't off to a good start, I hope you'll find this screenshot of the Seagate firmware updater as entertaining as I did: https://dl.dropbox.com/u/3782483/seagate_success.jpg

Let me know if there's anything additional you'd like me to try out. If you're not interested in pursuing this further we can close the issue.

dajhorn commented 11 years ago

I have a band-aid fix that may provide some additional clues: Adding a sleep 30 to the /etc/init/mountall.conf just prior to the exec mountall --daemon $force_fsck $fsck_fix line seems to resolve the issue.

That's fine for this server since it rarely reboots, but perhaps there's something that could/should be done in the ZFS to accomodate a hardware configuration that misreports / doesn't report / differently reports a ready state for the drives? I would not recommend a patch specific to this hardware, but maybe there's something in the way that ZFS checks for this (if it checks it at all - maybe this is left to the kernel?) that could be made more conservative.

This is ticket #330.

In case it's helpful: dmesg output: https://gist.github.com/309b79be66ef424e49e9 dmesg output w/ 30s mount delay: https://gist.github.com/912c82ccb5648a1f83b7

It looks like the ZFS driver is loading prior to /etc/init/mountall.conf entry, which does the rw remount. ZFS must not be loaded by /etc/modules or /etc/initramfs-tools/modules, and /sbin/zfs must not be invoked before /sbin/mountall.

Rather, ZoL currently depends on "fixed disk" behavior to reliably mount ZFS datasets in a way that integrates with upstart. This inconvenience should be resolved when the solution described in ticket #330 is implemented.

Finally, in case your Monday isn't off to a good start, I hope you'll find this screenshot of the Seagate firmware updater as entertaining as I did: https://dl.dropbox.com/u/3782483/seagate_success.jpg

Heh, but I appreciate why they did that. The earlier "Operation Complete. Not Error!" kind of result probably wasn't communicating the desired message.

Let me know if there's anything additional you'd like me to try out. If you're not interested in pursuing this further we can close the issue.

I bought some ST2000DM001 disks over the weekend, and I already have a Gigabyte motherboard. By coincidence, I will soon have a very similar computer on my bench to double-check this hardware configuration.

rdlugosz commented 11 years ago

It looks like the ZFS driver is loading prior to /etc/init/mountall.conf entry, which does the rw remount. ZFS must not be loaded by /etc/modules or /etc/initramfs-tools/modules, and /sbin/zfs must not be invoked before /sbin/mountall.

This is curious if it isn't the default behavior as this is a fresh install of Ubuntu 12.04 and the ubuntu-zfs ZoL package (here's a full package list). I've verified that there's no zfs in either of those modules files. Here's the contents of mountall.conf which doesn't seem to prescribe the order of things (though I'm ignorant of the fine details of how the mountall process works).

Only other thing I can think of that may impact loading order is that I had to load the LVM package in order to access my old root volume. This isn't zfs related, but perhaps it doesn't play nice with the order of things in mountall?

dajhorn commented 11 years ago

(Github swallowed that earlier half-post because I pushed my tab key at the wrong time.)

Double-check that the primary ATA controller is in AHCI mode (not Legacy or RAID mode) and that none of the ports are in ESP mode.

Also try removing the JMicron card. Some motherboards in this family have brain damage that can't be fixed in software. The one that I have behaves unreliably weird if a SiL or Promise controller is plugged in. (eg: Beep codes, randomly disappearing drives, etc.)

The default Upstart configuration is sensible, but not easy to understand at a glance. The problem is not in the configuration. If the ZFS modules are not being loaded early, then my guess is that the disks are not ready when the udevsettle command returns before mountall is called.

dajhorn commented 11 years ago

FYI, I could not reproduce with my new equipment.

rdlugosz commented 11 years ago

interesting... thanks for the followup.

dajhorn commented 11 years ago

@thatRD: Since you opened this ticket, a patch was submitted that has a small chance of resolving this issue for you. Be sure to enable the daily PPA so that you'll get the update, probably late next week.

NB: zfsonlinux/zfs#1167, dajhorn/pkg-zfs#70, dajhorn/pkg-zfs#66.

rdlugosz commented 11 years ago

Cool, I'll keep an eye out for it.

FransUrbo commented 10 years ago

@rdlugosz This issue is quite old (a year and a half), is it still a problem or can you close it?

rdlugosz commented 10 years ago

It is old, but I do still have the issue. Though I pretty infrequently restart the server, most times when I do at least one or two drives in my pool will be marked as "unavailable". The fix is to export and then immediately import the pool & magically all drives appear.

Ultimately this may need to be closed as "unable to reproduce", but there is definitely a real problem somewhere.

dswartz commented 10 years ago

It is old, but I do still have the issue. Though I pretty infrequently restart the server, most times when I do at least one or two drives in my pool will be marked as "unavailable". The fix is to export and then immediately import the pool & magically all drives appear.

Ultimately this may need to be closed as "unable to reproduce", but there is definitely a real problem somewhere.

I don't know if it's the issue I used to have, but I was told that some HBAs enumerate drives slowly, and if zfs loads before udev has created all the device links, you are screwed. Incomplete pools, missing zvols, unshared datasets, etc, were all symptoms of this for me.

behlendorf commented 8 years ago

Closing. Several patch have been merged which should have improved this.