openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
824 stars 72 forks source link

(master) zpool export -> panic & ONLINE cache vdevs not being used #195

Closed rottegift closed 10 years ago

rottegift commented 10 years ago

I noticed that the cache vdevs were ONLINE fine but nothing was being stored in them, so I exported two pools (one of which had only the root dataset mounted, one of which had been imported with -N and had nothing mounted). Result:

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Mon Jun 16 18:48:16 2014
panic(cpu 2 caller 0xffffff7f8884d34f): "VERIFY3(" "0" " " "==" " " "dsl_prop_unregister(ds, zfs_prop_to_name(ZFS_PROP_CHECKSUM), checksum_changed_cb, os)" ") " "failed (" "0" " " "==" " " "91" ")\n"@dmu_objset.c:610
Backtrace (CPU 2), Frame : Return Address
0xffffff81fc403950 : 0xffffff8008222fa9 mach_kernel : _panic + 0xc9
0xffffff81fc4039d0 : 0xffffff7f8884d34f net.lundman.zfs : _dmu_objset_evict + 0xef
0xffffff81fc403a80 : 0xffffff7f88866b3a net.lundman.zfs : _dsl_dataset_evict + 0x4a
0xffffff81fc403ab0 : 0xffffff7f8883dac0 net.lundman.zfs : _dbuf_evict_user + 0x80
0xffffff81fc403ad0 : 0xffffff7f8883f156 net.lundman.zfs : _dbuf_rele_and_unlock + 0xf6
0xffffff81fc403b10 : 0xffffff7f8883c300 net.lundman.zfs : _dbuf_rele + 0x30
0xffffff81fc403b30 : 0xffffff7f8884a6a8 net.lundman.zfs : _dmu_buf_rele + 0x28
0xffffff81fc403b60 : 0xffffff7f88866c41 net.lundman.zfs : _dsl_dataset_rele + 0x21
0xffffff81fc403b80 : 0xffffff7f8884cead net.lundman.zfs : _dmu_objset_rele + 0x2d
0xffffff81fc403bb0 : 0xffffff7f888f119c net.lundman.zfs : _zfs_ioc_snapshot_list_next + 0x1ec
0xffffff81fc403c30 : 0xffffff7f888ecd23 net.lundman.zfs : _zfsdev_ioctl + 0x643
0xffffff81fc403cf0 : 0xffffff800840d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff81fc403d40 : 0xffffff80083fe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff81fc403dc0 : 0xffffff80083f3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff81fc403e10 : 0xffffff80085c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff81fc403e40 : 0xffffff80085f2c66 mach_kernel : _ioctl + 0x466
0xffffff81fc403f50 : 0xffffff8008640653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff81fc403fb0 : 0xffffff80082f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[C718A424-859D-3AFF-85CE-2222C6A5EA82]@0xffffff7f8882b000->0xffffff7f88a37fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f887ec000
            dependency: net.lundman.spl(1.0.0)[38C86468-875F-383E-942A-456BE7165109]@0xffffff7f8881a000
rottegift commented 10 years ago

Bah, I'm seeing this a lot.

The cache vdevs are found at manual import but no l2arc data is put into them even under use.

$ zpool import -N ssdpool
$ zpool iostat -v
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
ssdpool      171G  86.9G     35     52  94.1K   180K
  mirror     171G  86.9G     35     52  94.1K   180K
    disk16      -      -     26     32   194K   340K
    disk9       -      -     23     33   156K   340K
logs            -      -      -      -      -      -
  mirror     168K   936M      0      0      0      0
    disk17      -      -      7      1  81.0K  54.3K
    disk6       -      -      7      1  81.0K  54.3K
cache           -      -      -      -      -      -
  disk32s5      0  16.2G      0      0  8.11K  1.43K
  disk33s5      0  16.2G      0      0  8.12K  1.43K
----------  -----  -----  -----  -----  -----  -----
$ zpool status -v
  pool: ssdpool
 state: ONLINE
  scan: scrub canceled on Sun Jun 15 12:02:41 2014
config:

    NAME        STATE     READ WRITE CKSUM
    ssdpool     ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        disk16  ONLINE       0     0     0
        disk9   ONLINE       0     0     0
    logs
      mirror-1  ONLINE       0     0     0
        disk17  ONLINE       0     0     0
        disk6   ONLINE       0     0     0
    cache
      disk32s5  ONLINE       0     0     0
      disk33s5  ONLINE       0     0     0

errors: No known data errors
rottegift commented 10 years ago

The ONLINE cache vdevs that do not get use seem to occur when the pool is not cleanly exported and when there exists an /etc/zfs/zpool.cache at all.

  1. auto imported pool is FAULTED (or DEGRADED) due to device renumbering
  2. zpool export pool / zpool import [-N] pool -> cache vdevs ONLINE but nothing stored
  3. any subsequent pool imported manually gets stuck with the same result (ONLINE cache vdevs, no actual l2arc storage)

This does not seem to happen when there is no zpool.cache. It also does not happen always, especially after clean exports.

Normally one would see a few hundred KiB or more in at least one of the cache vdevs even with zpool import -N pool.

[master, tip of tree today]

rottegift commented 10 years ago

Panic on export, tip of tree today:

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jun 17 11:19:05 2014
panic(cpu 2 caller 0xffffff8021cdbf5e): Kernel trap at 0xffffff7fa2338f92, type 13=general protection, registers:
CR0: 0x000000008001003b, CR2: 0x000000010b12b000, CR3: 0x00000000241d0000, CR4: 0x00000000001606e0
RAX: 0x0000000000000000, RBX: 0xffffff8021c36e50, RCX: 0x65686e69246e6f69, RDX: 0xffffff8213ce2448
RSP: 0xffffff821119bb60, RBP: 0xffffff821119bbc0, RSI: 0x0000000000000001, RDI: 0xffffff7fa23c0d88
R8:  0xffffff822cba0035, R9:  0x0000000000000035, R10: 0x0000000000000037, R11: 0xffffff821119bb00
R12: 0xffffff804e549280, R13: 0xffffff804cf62140, R14: 0x0000000000000000, R15: 0xffffff804cf62140
RFL: 0x0000000000010246, RIP: 0xffffff7fa2338f92, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x000000010b12b000, Error code: 0x0000000000000000, Fault CPU: 0x2

Backtrace (CPU 2), Frame : Return Address
0xffffff8202075df0 : 0xffffff8021c22fa9 mach_kernel : _panic + 0xc9
0xffffff8202075e70 : 0xffffff8021cdbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff8202076040 : 0xffffff8021cf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff8202076060 : 0xffffff7fa2338f92 net.lundman.zfs : _avl_remove + 0xa2
0xffffff821119bbc0 : 0xffffff7fa22b6285 net.lundman.zfs : _unique_remove + 0x65
0xffffff821119bc00 : 0xffffff7fa2266a65 net.lundman.zfs : _dsl_dataset_evict + 0x25
0xffffff821119bc30 : 0xffffff7fa223da10 net.lundman.zfs : _dbuf_evict_user + 0x80
0xffffff821119bc50 : 0xffffff7fa223f0a6 net.lundman.zfs : _dbuf_rele_and_unlock + 0xf6
0xffffff821119bc90 : 0xffffff7fa223c250 net.lundman.zfs : _dbuf_rele + 0x30
0xffffff821119bcb0 : 0xffffff7fa224a5f8 net.lundman.zfs : _dmu_buf_rele + 0x28
0xffffff821119bce0 : 0xffffff7fa2266b91 net.lundman.zfs : _dsl_dataset_rele + 0x21
0xffffff821119bd00 : 0xffffff7fa2280b2a net.lundman.zfs : _dsl_scan_visitds + 0x49a
0xffffff821119bd70 : 0xffffff7fa2280031 net.lundman.zfs : _dsl_scan_visit + 0x461
0xffffff821119bdd0 : 0xffffff7fa227f1f2 net.lundman.zfs : _dsl_scan_sync + 0x5d2
0xffffff821119be30 : 0xffffff7fa22a4192 net.lundman.zfs : _spa_sync + 0x4b2
0xffffff821119bee0 : 0xffffff7fa22b4786 net.lundman.zfs : _txg_sync_thread + 0x3e6
0xffffff821119bfb0 : 0xffffff8021cd7127 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[10E9E953-EA4B-3364-9581-F5E75E2DF966]@0xffffff7fa222b000->0xffffff7fa2437fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fa21ec000
            dependency: net.lundman.spl(1.0.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7fa221a000
ilovezfs commented 10 years ago

I'd be curious if any of these issues you've been reporting resolve if you get rid of the cache devices.

rottegift commented 10 years ago

Perhaps, but I would also get rid of pool performance. The amount of IOPS soaked up by the l2arc becomes large.

ilovezfs commented 10 years ago

Well, yes, production and debugging aren't the same.

rottegift commented 10 years ago

Ok, that was your point, and OK, I'll play for a bit without the cache vdevs.

rottegift commented 10 years ago

Snag 0: zpool remove of the cache devs has to wait on a resilver; one of the ungraceful reboots apparently was especially and unusually ungraceful. I can just remove the physical devices themselves easily enough, of course.

Snag 1 - disk32s2 is in Donkey (imported) and "zpool import" says Trinity hates that.

[master, zpool.cache existed at zfs subsystem start]

$ zpool status -v Donkey
  pool: Donkey
 state: ONLINE
  scan: resilvered 4.33M in 0h0m with 0 errors on Tue Jun 17 09:43:50 2014
config:

    NAME          STATE     READ WRITE CKSUM
    Donkey        ONLINE       0     0     0
      mirror-0    ONLINE       0     0     0
        disk32s2  ONLINE       0     0     0
        disk33s2  ONLINE       0     0     0
        disk31s2  ONLINE       0     0     0
    logs
      mirror-1    ONLINE       0     0     0
        disk9s1   ONLINE       0     0     0
        disk15s1  ONLINE       0     0     0

errors: No known data errors
   pool: Trinity
     id: 5933279091430968458
  state: DEGRADED
 status: One or more devices are missing from the system.
 action: The pool can be imported despite missing or damaged devices.  The
    fault tolerance of the pool may be compromised if imported.
   see: http://zfsonlinux.org/msg/ZFS-8000-2Q
 config:

    Trinity       DEGRADED
      mirror-0    ONLINE
        disk23    ONLINE
        disk25s2  ONLINE
      mirror-1    ONLINE
        disk20s2  ONLINE
        disk24s2  ONLINE
      mirror-2    DEGRADED
        disk22s2  ONLINE
        disk32s2  UNAVAIL  corrupted data
      mirror-3    ONLINE
        disk21s2  ONLINE
        disk19s2  ONLINE
    cache
      disk28s4
      disk27s4
    logs
      mirror-4    ONLINE
        disk14s1  ONLINE
        disk6s1   ONLINE
rottegift commented 10 years ago

("zpool import" when there is no zpool.cache at startup shows no conflicts, and importing the pools DTRT.)

ilovezfs commented 10 years ago

@rottegift So no panic now since removing the cache vdevs?

rottegift commented 10 years ago

Nope. I have done a fair number of exports (which triggered them before) while dealing with the other matter, but have not seen the panics above.

rottegift commented 10 years ago

vdev-iokit seems to be immune to both these problems, fyi.

rottegift commented 10 years ago

closing to focus on the cache vdev problem