vdev_iokit branch - Githubissues

openzfsonosx / zfs

OpenZFS on OS X

https://openzfsonosx.org/

Other

822 stars 72 forks source link

vdev_iokit branch #178

Closed lundman closed 4 years ago

lundman commented 10 years ago

Thanks for trying out the iokit replacement. It does indeed appear to function well, and receives almost identical benchmark score as master.

https://github.com/openzfsonosx/zfs/blob/vdev-iokit/module/zfs/vdev_iokit.c#L88

Better watchout with those comments, you just took out the clearing on line 93. But since it is the free function, it doesn't matter ;)

https://github.com/openzfsonosx/zfs/blob/vdev-iokit/module/zfs/vdev_iokit.c#L340 https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/fs/zfs/vdev_disk.c#L416

Is the plan to extend it to attempt to open_by_guid as well? I see you implemented something in vdev_iokit_util.cpp, is it part of your roadmap to add that ability? If it can handle the device /dev/diskX moving, then it would make it quite attractive.

rottegift commented 10 years ago

By comparison, booting with no /etc/zfs/zpool.cache and then manually importing these three pools works normally (all vdevs found, all zfs datasets that are automountable get mounted).

evansus commented 10 years ago

https://gist.github.com/rottegift/ff93ec481246461a791a#comment-1246319

evansus commented 10 years ago

@rottegift Actually I believe there is a bug causing this after all :)

I scripted creating 10 128MB zvols and and created a pool with this layout:

zpool create -o ashift=12 -o failmode=continue -O compression=lz4 \
-O canmount=off -O mountpoint=none tank \
mirror disk2 disk3 \
log mirror disk4 disk5 \
cache disk6 disk7 \
spare disk8 disk9

And here's the result of zpool status immediately after:

  pool: tank
 state: ONLINE
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://zfsonlinux.org/msg/ZFS-8000-2Q
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            disk2   ONLINE       0     0     0
            disk3   ONLINE       0     0     0
        logs
          mirror-1  ONLINE       0     0     0
            disk4   ONLINE       0     0     0
            disk5   ONLINE       0     0     0
        cache
          disk6     UNAVAIL      0     0     0  cannot open
          disk7     UNAVAIL      0     0     0  cannot open
        spares
          disk8     UNAVAIL   cannot open
          disk9     UNAVAIL   cannot open

errors: No known data errors

Check out https://gist.github.com/evansus/e4391d2d955d1ca22098 for the full script with substitutions. The function zvol_to_disk translates pool/hotspare1 to disk2, etc.

evansus commented 10 years ago

I'm implementing a couple fixes and decided to test this hotspares.sh script on master. It stalled/deadlocked during zpool create on master, but on vdev-iokit it partially succeeds as above.

spindump excerpt: https://gist.github.com/evansus/f3881cfe2cf404832420

rottegift commented 10 years ago

With ToT ( a59b8e052c69908744fe6d421e106acf199a89fd ) I got a quick panic:

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Sun Jun 15 23:24:20 2014
panic(cpu 1 caller 0xffffff8006edbf5e): Kernel trap at 0xffffff7f8753dfe0, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0x0000000000000000, CR3: 0x000000002bd6b008, CR4: 0x00000000001606e0
RAX: 0x0000000000000010, RBX: 0xffffff7f87404820, RCX: 0x000000000003eb74, RDX: 0xffffff81f2bd30e8
RSP: 0xffffff81f2bd30b0, RBP: 0xffffff81f2bd30e0, RSI: 0x0000000000000000, RDI: 0xffffff81f635cab8
R8:  0x0000000000000001, R9:  0xffffff81fcd301a0, R10: 0x00000000000000b3, R11: 0x0000000000000020
R12: 0xffffff80278b1f00, R13: 0xffffff80278b1f00, R14: 0xffffff81fb96ee08, R15: 0xffffff802cd775c0
RFL: 0x0000000000010282, RIP: 0xffffff7f8753dfe0, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000000000000, Error code: 0x0000000000000000, Fault CPU: 0x1

Backtrace (CPU 1), Frame : Return Address
0xffffff81f2bd2d40 : 0xffffff8006e22fa9 mach_kernel : _panic + 0xc9
0xffffff81f2bd2dc0 : 0xffffff8006edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff81f2bd2f90 : 0xffffff8006ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff81f2bd2fb0 : 0xffffff7f8753dfe0 net.lundman.zfs : _nv_mem_zalloc + 0x20
0xffffff81f2bd30e0 : 0xffffff7f87542474 net.lundman.zfs : _nvp_buf_alloc + 0x44
0xffffff81f2bd3130 : 0xffffff7f87542298 net.lundman.zfs : _nvs_decode_pairs + 0x78
0xffffff81f2bd3170 : 0xffffff7f87542106 net.lundman.zfs : _nvs_operation + 0xc6
0xffffff81f2bd31b0 : 0xffffff7f87543ccb net.lundman.zfs : _nvs_embedded + 0xcb
0xffffff81f2bd31f0 : 0xffffff7f8754318d net.lundman.zfs : _nvs_xdr_nvp_op + 0x1fd
0xffffff81f2bd3270 : 0xffffff7f875422ca net.lundman.zfs : _nvs_decode_pairs + 0xaa
0xffffff81f2bd32b0 : 0xffffff7f87542106 net.lundman.zfs : _nvs_operation + 0xc6
0xffffff81f2bd32f0 : 0xffffff7f87541f15 net.lundman.zfs : _nvs_xdr + 0x85
0xffffff81f2bd3360 : 0xffffff7f87541a9b net.lundman.zfs : _nvlist_common + 0x1eb
0xffffff81f2bd33f0 : 0xffffff7f87541db9 net.lundman.zfs : _nvlist_xunpack + 0x79
0xffffff81f2bd3440 : 0xffffff7f87541d2d net.lundman.zfs : _nvlist_unpack + 0x4d
0xffffff81f2bd3490 : 0xffffff7f874bffbd net.lundman.zfs : _vdev_iokit_read_label + 0x1ed
0xffffff81f2bd3510 : 0xffffff7f874c09a7 net.lundman.zfs : _vdev_iokit_find_by_guid + 0x107
0xffffff81f2bd3580 : 0xffffff7f874c1134 net.lundman.zfs : _vdev_iokit_open_by_guid + 0x24
0xffffff81f2bd35b0 : 0xffffff7f874bf803 net.lundman.zfs : _vdev_iokit_open + 0x383
0xffffff81f2bd3630 : 0xffffff7f874b8d7b net.lundman.zfs : _vdev_open + 0x12b
0xffffff81f2bd36b0 : 0xffffff7f874b8b1b net.lundman.zfs : _vdev_open_children + 0x5b
0xffffff81f2bd36e0 : 0xffffff7f874c522f net.lundman.zfs : _vdev_mirror_open + 0x5f
0xffffff81f2bd3750 : 0xffffff7f874b8d7b net.lundman.zfs : _vdev_open + 0x12b
0xffffff81f2bd37d0 : 0xffffff7f874b8b1b net.lundman.zfs : _vdev_open_children + 0x5b
0xffffff81f2bd3800 : 0xffffff7f874cd1bf net.lundman.zfs : _vdev_root_open + 0x5f
0xffffff81f2bd3850 : 0xffffff7f874b8d7b net.lundman.zfs : _vdev_open + 0x12b
0xffffff81f2bd38d0 : 0xffffff7f874a5b8b net.lundman.zfs : _spa_load_impl + 0x1ab
0xffffff81f2bd3a10 : 0xffffff7f874a010b net.lundman.zfs : _spa_load + 0x1fb
0xffffff81f2bd3a80 : 0xffffff7f8749f7d8 net.lundman.zfs : _spa_load_best + 0x98
0xffffff81f2bd3af0 : 0xffffff7f8749bc0a net.lundman.zfs : _spa_open_common + 0x19a
0xffffff81f2bd3b80 : 0xffffff7f8749c014 net.lundman.zfs : _spa_get_stats + 0x64
0xffffff81f2bd3c00 : 0xffffff7f874f2457 net.lundman.zfs : _zfs_ioc_pool_stats + 0x37
0xffffff81f2bd3c30 : 0xffffff7f874ef163 net.lundman.zfs : _zfsdev_ioctl + 0x643
0xffffff81f2bd3cf0 : 0xffffff800700d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff81f2bd3d40 : 0xffffff8006ffe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff81f2bd3dc0 : 0xffffff8006ff3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff81f2bd3e10 : 0xffffff80071c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff81f2bd3e40 : 0xffffff80071f2c66 mach_kernel : _ioctl + 0x466
0xffffff81f2bd3f50 : 0xffffff8007240653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff81f2bd3fb0 : 0xffffff8006ef3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[699A434F-195A-3E8C-A8B7-51F237BEB091]@0xffffff7f8742b000->0xffffff7f8763cfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f873ec000
            dependency: net.lundman.spl(1.0.0)[38C86468-875F-383E-942A-456BE7165109]@0xffffff7f8741a000

evansus commented 10 years ago

With 0c6e5cc431e7e314037e3862e06916b790467d70, issues with cache and spares should be resolved. Waiting to get feedback from @rottegift. Did some cstyle cleanups in b0fad51d1f7b0705093a54ddbafd648d73912c94 and 1565e74aa4cf1351901338c8a05027e0f38b9d1f as well.

evansus commented 10 years ago

Testing using hotspares.sh -k

cachefile import and manual import, with all devices available, all devices renumbered, and with one cache device missing. Haven't experienced any issues so far.

rottegift commented 10 years ago

First boot, only ssdpool in zpool.cache after graceful shutdown.

The pool does not autoimport on its own, as before (and that's different from master, which auto-imports before any zpool/zfs command is issued).

I started with "zpool list" - it quickly imported ssdpool with all the vdevs (including the cache); the cache vdevs immediately got serviced by the feed thread. The canmount=yes datasets were not mounted when the "zpool list" triggered the auto import; I had to issue a "zfs mount -a".

So far pretty good. I'm going to try with the other pools in zpool.cache with graceful shutdowns and panics and then throw some heavier load at it.

rottegift commented 10 years ago

import the three usual suspect pools
clean restart
get to Terminal prompt, no auto-import (even after a couple of minutes)
at Terminal prompt "zfs list"
~~See all the expected canmount=on datasets after a few moments.~~
Check everything's imported and running properly. It is. Cool.

At this point I'm already tempted to replace /usr/local/libexec/zfs/launchd.d/zpool-autoimport.s" with something that does something not much more complicated than "zfs list || complain" in org.openzfsonosx.zpool-autoimport.plist and load -wF that.

Next up: deliberate panics of a couple types and ugly shutdowns and regular use. If anything goes wrong, I'll follow up here. However, I think this is already a vast improvement over previous days and over master.

Correction: 5 was not really the case. Sorry.

evansus commented 10 years ago

Glad to hear it's working well!

Strange that it won't autoimport / automount. I'm looking over the comparison between master and vdev-iokit: https://github.com/openzfsonosx/zfs/compare/vdev-iokit?expand=1

I'm not seeing anything obvious that would interfere with autoimport. It's possible that the IOMedia devices aren't ready at the time the kext loads. That should't be an issue, but in your setup there are quite a lot of disks to bring online. We could add/enable some debug printing to see if the autoimport is being called, but perhaps is unable to succeed during that phase of booting.

I did experience that symptom when:

Staging a dev build into a directory, then doing rsync --exclude=/Library to skip LaunchDaemon scripts
Skipping launchd scripts in ./configure
On a different branch, iokit-boot, due to the kext loading in the early boot environment. Currently I have a somewhat ugly hack that delays the pool import attempt until a moment after the kext finishes loading.

Actually though, it sounds to me like the kext isn't even loaded after the boot is complete - manually running zpool or zfs tries to load the kext, which in turn will autoimport the pools from the cachefile. Could you try a fresh reboot followed by kextstat | grep lund, before running zpool or zfs?

rottegift commented 10 years ago

https://github.com/openzfsonosx/zfs/issues/197 is my current favourite low-privilege panic attack. It unfailingly results in a l-o-n-g grey screen and often an fsck_hfs on the startup volume.
System comes up after panic, do a "zfs list", see the listing of all the datasets.
But no datasets are mounted. (5 above was a mistake).
"zfs mount -a" DTRT. "zpool {status,iostat} -v" shows everything's OK.

So I'd say this works.

rottegift commented 10 years ago

"Actually though, it sounds to me like the kext isn't even loaded after the boot is complete"

No, I don't think so, but I'll do the kextstat in a moment.

WebKit's in there because of the "zed" in "initialized", but it gives you an idea. It came up because it was open when I caused the panic. The first zed message was triggered by the "zfs list".

Jun 17 22:35:21 cla kernel[0]: ZFS: Loaded module v0.6.3-1, ZFS pool version 5000, ZFS filesystem version 5
Jun 17 22:35:21 cla kernel[0]: ZFS filesystem version: 5
Jun 17 22:35:21 cla kernel[0]: ZFS: hostid set to 59e8b54b from UUID '001E74DF-37A5-5E8A-8591-CC0948845E3E'
Jun 17 22:35:23 cla.use.net SmartwareServiceApp[202]: file names: ((.+\.)((vmwarevm)|(vmdk)|(vmem)|(mem)|(vhd)|(hdd)))|(((\.DS_Store)|(\.localized)|(\.Trash)))
Jun 17 22:35:25 cla.use.net apsd[191]: Unrecognized leaf certificate
Jun 17 22:35:34 cla.use.net SafariForWebKitDevelopment[568]: WebKit r170053 initialized.
Jun 17 22:36:03 cla.use.net zed[1038]: eid=1 class=statechange 
Jun 17 22:36:04 cla.use.net zed[1040]: eid=2 class=statechange 
Jun 17 22:36:05 cla.use.net zed[1047]: eid=3 class=statechange 
Jun 17 22:36:05 cla.use.net zed[1053]: eid=4 class=statechange 
Jun 17 22:36:06 cla.use.net zed[1056]: eid=5 class=statechange 
Jun 17 22:36:06 cla.use.net zed[1058]: eid=6 class=statechange 
Jun 17 22:36:06 cla.use.net zed[1064]: eid=7 class=statechange 
Jun 17 22:36:06 cla.use.net zed[1066]: eid=8 class=statechange 
Jun 17 22:36:07 cla.use.net zed[1068]: eid=9 class=statechange 
Jun 17 22:36:07 cla.use.net zed[1070]: eid=10 class=statechange 
Jun 17 22:36:08 cla.use.net zed[1076]: eid=11 class=statechange 
Jun 17 22:36:08 cla.use.net zed[1078]: eid=12 class=statechange 
Jun 17 22:36:08 cla.use.net zed[1080]: eid=13 class=statechange 
Jun 17 22:36:13 cla.use.net zed[1108]: eid=14 class=zvol.create pool=Donkey
...

rottegift commented 10 years ago

Rebooted, did:

$ sudo kextstat | grep lund
Password:
   98    1 0xffffff7f80819000 0x21c      0x21c      net.lundman.kernel.dependencies (10.0.0)
   99    1 0xffffff7f8081a000 0x11000    0x11000    net.lundman.spl (1.0.0) <98 7 5 4 3 1>
  100    0 0xffffff7f8082b000 0x212000   0x212000   net.lundman.zfs (1.0.0) <99 15 7 5 4 3 1>

Then did "zfs mount -a" which is hanging after getting through several zed events

Jun 17 23:12:41 cla.use.net zed[1126]: eid=1 class=statechange 
Jun 17 23:12:41 cla.use.net zed[1128]: eid=2 class=statechange 
Jun 17 23:12:42 cla.use.net zed[1134]: eid=3 class=statechange 
Jun 17 23:12:50 cla.use.net zed[1175]: eid=4 class=statechange 
Jun 17 23:12:50 cla.use.net zed[1181]: eid=5 class=statechange 
Jun 17 23:12:51 cla.use.net zed[1183]: eid=6 class=statechange 
Jun 17 23:12:51 cla.use.net zed[1185]: eid=7 class=statechange 
Jun 17 23:12:51 cla.use.net zed[1187]: eid=8 class=statechange 
Jun 17 23:12:51 cla.use.net zed[1196]: eid=9 class=statechange 
Jun 17 23:12:52 cla.use.net zed[1199]: eid=10 class=statechange 
Jun 17 23:12:52 cla.use.net zed[1202]: eid=11 class=statechange 
Jun 17 23:12:52 cla.use.net zed[1208]: eid=12 class=vdev.open_failed pool=Donkey
Jun 17 23:12:53 cla.use.net zed[1211]: eid=13 class=vdev.open_failed pool=Donkey
Jun 17 23:12:57 cla.use.net zed[1239]: eid=14 class=zvol.create pool=Donkey
Jun 17 23:12:57 cla.use.net zed[1247]: eid=14 class=zvol.create pool=Donkey/TM symlinked disk33
Jun 17 23:12:57 cla.use.net zed[1257]: eid=15 class=zvol.create pool=Donkey
Jun 17 23:12:57 cla.use.net zed[1268]: eid=15 class=zvol.create pool=Donkey/Caching symlinked disk34
Jun 17 23:12:57 cla.use.net zed[1278]: eid=16 class=zvol.create pool=Donkey
Jun 17 23:12:57 cla.use.net zed[1293]: eid=16 class=zvol.create pool=Donkey/TMMIS symlinked disk35
Jun 17 23:12:58 cla.use.net zed[1300]: eid=17 class=statechange 
Jun 17 23:12:59 cla.use.net zed[1343]: eid=18 class=statechange 
Jun 17 23:12:59 cla.use.net zed[1345]: eid=19 class=statechange 
Jun 17 23:12:59 cla.use.net zed[1350]: eid=20 class=statechange 
Jun 17 23:12:59 cla.use.net zed[1355]: eid=21 class=statechange 
Jun 17 23:13:00 cla.use.net zed[1362]: eid=22 class=statechange 
Jun 17 23:13:01 cla.use.net zed[1369]: eid=23 class=statechange 
Jun 17 23:13:01 cla.use.net zed[1374]: eid=24 class=statechange 
Jun 17 23:13:01 cla.use.net zed[1376]: eid=25 class=statechange 
Jun 17 23:13:02 cla.use.net zed[1384]: eid=26 class=statechange 
Jun 17 23:13:03 cla.use.net zed[1391]: eid=27 class=statechange 
Jun 17 23:13:03 cla.use.net zed[1394]: eid=28 class=statechange 
Jun 17 23:13:03 cla.use.net zed[1396]: eid=29 class=statechange 
Jun 17 23:13:03 cla.use.net zed[1398]: eid=30 class=statechange 
Jun 17 23:13:03 cla.use.net zed[1400]: eid=31 class=statechange 
Jun 17 23:13:04 cla.use.net zed[1407]: eid=32 class=statechange 
Jun 17 23:13:04 cla.use.net zed[1409]: eid=33 class=statechange 
Jun 17 23:13:04 cla.use.net zed[1411]: eid=34 class=statechange 
Jun 17 23:13:04 cla.use.net zed[1413]: eid=35 class=statechange 
Jun 17 23:13:04 cla.use.net zed[1418]: eid=36 class=statechange

One zvol is mounted

/dev/disk33s2 on /Volumes/DONKEYTM (hfs, local, journaled)

but not the other two, and no zfs datasets at all.

spindump:

https://gist.github.com/rottegift/668a6efd270deec52c90

rottegift commented 10 years ago

sudo kextstat | grep lund [works as expected]
zpool status -v [works as now expected - everything's there]
zfs mount [returns nothing]
zfs mount -a

BOOM.

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jun 17 23:27:18 2014
panic(cpu 0 caller 0xffffff802b8dbf5e): Kernel trap at 0xffffff7fabf2a9f4, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0xfffffffffffffff0, CR3: 0x000000002ddd5000, CR4: 0x00000000001606e0
RAX: 0xfffffffffffffff0, RBX: 0xffffff821ae46938, RCX: 0xffffff821af1b870, RDX: 0xfffffffffffffff0
RSP: 0xffffff822fc5bda0, RBP: 0xffffff822fc5bdd0, RSI: 0xffffff824acbb948, RDI: 0xffffff821af1b970
R8:  0x00000000ffff0000, R9:  0x00000000ffff0000, R10: 0xffffff82161e9080, R11: 0xffffff802be89ce8
R12: 0xffffff821ae46978, R13: 0x0000000000009ce9, R14: 0xffffff821ae46958, R15: 0xffffff82323afbc8
RFL: 0x0000000000010286, RIP: 0xffffff7fabf2a9f4, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0xfffffffffffffff0, Error code: 0x0000000000000000, Fault CPU: 0x0

Backtrace (CPU 0), Frame : Return Address
0xffffff822fc5ba30 : 0xffffff802b822fa9 mach_kernel : _panic + 0xc9
0xffffff822fc5bab0 : 0xffffff802b8dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff822fc5bc80 : 0xffffff802b8f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff822fc5bca0 : 0xffffff7fabf2a9f4 net.lundman.zfs : _zio_walk_parents + 0x94
0xffffff822fc5bdd0 : 0xffffff7fabf31673 net.lundman.zfs : _zio_done + 0x1073
0xffffff822fc5bef0 : 0xffffff7fabf2cb8a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff822fc5bf30 : 0xffffff7fabf2ca55 net.lundman.zfs : _zio_execute + 0x15
0xffffff822fc5bf50 : 0xffffff7fabe1e217 net.lundman.spl : _taskq_thread + 0xc7
0xffffff822fc5bfb0 : 0xffffff802b8d7127 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7fabe1a000->0xffffff7fabe2afff
         net.lundman.zfs(1.0)[0C7B3320-A055-3936-B86A-8C6E3EC2CB6D]@0xffffff7fabe2b000->0xffffff7fac03cfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fabdec000
            dependency: net.lundman.spl(1.0.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7fabe1a000

rottegift commented 10 years ago

Then:

[no kextstat]

zpool status -v [OK]
zfs mount [returns nothing]
zfs mount -a

and everything's fine.

rottegift commented 10 years ago

Panic under load. Yikes!

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Tue Jun 17 23:49:12 2014
panic(cpu 6 caller 0xffffff802badbf5e): Kernel trap at 0xffffff7fac01c9f8, type 14=page fault, registers:
CR0: 0x0000000080010033, CR2: 0x0000000100000008, CR3: 0x0000000222bb5159, CR4: 0x00000000001606e0
RAX: 0x0000000000000000, RBX: 0xffffff805ca3c4b0, RCX: 0x0000000100000000, RDX: 0xffffff835bfd5d68
RSP: 0xffffff821dafb8f0, RBP: 0xffffff821dafb8f0, RSI: 0xffffff835bfd5fc8, RDI: 0xffffff835bfd5d58
R8:  0x0000000000000000, R9:  0xffffff802c001910, R10: 0x00000000000003ff, R11: 0xffffffffffffffff
R12: 0xffffff821dafbd50, R13: 0xffffff821dafbd98, R14: 0xffffff8069361e80, R15: 0xffffff821dafbde8
RFL: 0x0000000000010286, RIP: 0xffffff7fac01c9f8, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000100000008, Error code: 0x0000000000000002, Fault CPU: 0x6

Backtrace (CPU 6), Frame : Return Address
0xffffff821dafb580 : 0xffffff802ba22fa9 mach_kernel : _panic + 0xc9
0xffffff821dafb600 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafb7d0 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafb7f0 : 0xffffff7fac01c9f8 net.lundman.spl : Backtrace (CPU 6), Frame : Return Address
0xffffff821dafada0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821dafadd0 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821dafae50 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafb020 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafb040 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafb260 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafb2d0 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821dafb330 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821dafb540 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821dafb580 : 0xffffff802ba22fa9 mach_kernel : _panic + 0xc9
0xffffff821dafb600 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafb7d0 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafb7f0 : 0xffffff7fac01c9f8 net.lundman.spl : Backtrace (CPU 6), Frame : Return Address
0xffffff821dafa5c0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821dafa5f0 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821dafa670 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafa840 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafa860 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafaa80 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafaaf0 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821dafab50 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821dafad60 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821dafada0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821dafadd0 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821dafae50 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafb020 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafb040 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafb260 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafb2d0 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821dafb330 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821dafb540 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821dafb580 : 0xffffff802ba22fa9 mach_kernel : _panic + 0xc9
0xffffff821dafb600 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafb7d0 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafb7f0 : 0xffffff7fac01c9f8 net.lundman.spl : Backtrace (CPU 6), Frame : Return Address
0xffffff821daf9de0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821daf9e10 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821daf9e90 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafa060 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafa080 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafa2a0 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafa310 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821dafa370 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821dafa580 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821dafa5c0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821dafa5f0 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821dafa670 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafa840 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafa860 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafaa80 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafaaf0 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821dafab50 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821dafad60 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821dafada0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821dafadd0 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821dafae50 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafb020 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafb040 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafb260 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafb2d0 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821dafb330 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821dafb540 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821dafb580 : 0xffffff802ba22fa9 mach_kernel : _panic + 0xc9
0xffffff821dafb600 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafb7d0 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafb7f0 : 0xffffff7fac01c9f8 net.lundman.spl : Backtrace (CPU 6), Frame : Return Address
0xffffff821daf9600 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821daf9630 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821daf96b0 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821daf9880 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821daf98a0 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821daf9ac0 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821daf9b30 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821daf9b90 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821daf9da0 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821daf9de0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821daf9e10 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821daf9e90 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafa060 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafa080 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafa2a0 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafa310 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821dafa370 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821dafa580 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821dafa5c0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821dafa5f0 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821dafa670 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafa840 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafa860 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafaa80 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafaaf0 : 0xffffff802bae0a49 mach_kernel : _machine_boot_info + 0x1b9
0xffffff821dafab50 : 0xffffff802bae074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff821dafad60 : 0xffffff802bae0294 mach_kernel : _Debugger + 0xa4
0xffffff821dafada0 : 0xffffff802ba2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff821dafadd0 : 0xffffff802ba22f2f mach_kernel : _panic + 0x4f
0xffffff821dafae50 : 0xffffff802badbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff821dafb020 : 0xffffff802baf3456 mach_kernel : _return_from_trap + 0xe6
0xffffff821dafb040 : 0xffffff802ba2fcf1 mach_kernel : ___doprnt + 0xed1
0xffffff821dafb260 : 0xffffff802ba30276 mach_kernel : _kdb_printf + 0x66
0xffffff821dafb2d0 : 0xffffff802bae0a49 mach
Model: Macmini6,2, BootROM MM61.0106.B03, 4 processors, Intel Core i7, 2.6 GHz, 16 GB, SMC 2.8f0

evansus commented 10 years ago

Hm, that's a nondescript backtrace... What all do you run to put the pool under load?

rottegift commented 10 years ago

Yeah, something went badly wrong there.

Ok, sources of load - this box is meant to be brutalized:

ssdpool contains a home directory, macports /opt, a squid aufs dataset, and some storage for a couple of VMs. Load comes from things like: Chrome Canary and WebKit Nightly firing up with scores of windows and hundreds of tabs resulting in enormous memory pressure and lots of small file traffic on their own caches and the squid cache; macports port sync and building things like the latest clang and llvm; automatic snapshotting and snapshot aging-out; and a periodic incremental zfs send. There are a few things that can trigger a lot more small-object activity (Mail.app indexing, and so forth). Occasional running of omnios or smartos guests (and sub-vms in that) can also be relied upon to add to memory pressure and read-modify-write activity in the VMWare Fusion .vmem file.

The other two pools see more occasional use but can be intensive too - mostly it's sequential but there is a substantial amount of less sequential read-modify-write work that gets done on them from time to time. The pool with the three-way mirror holds some other VM-related files, but is more torture tested through a pair of zvols which serve as time machine targets -- one for the local system and one that is served up using the OS X Server time machine mechanism (which mostly results in sparse bundles inside the zvol).

I've turned on dedup on some reasonable datasets, and that just adds to the small-object random traffic; it's fairly noticeable when time machine backups pour in from the network.

The system gets other non-zfs load too; it sends time machine backups to two other systems via afp, and for fun it runs a swupdate server whose data storage is an nfsv4 mount.

Some of this is weakly mutexable (or deferrable or outright avoidable) if I want to have the system up and running without problems to test something specific, but on the other hand I can create an enormous workload for an hour or so by starting most of the above soon after system startup. That tends to be when most odd problems get noticed (like https://github.com/openzfsonosx/zfs/issues/179 which happens when the boxes offering up the NFS and AFP storage are being saturated with traffic).

P.S.: For really fun load which I'll return to when the dust settles in this branch and depending on ZoL's work on integrating the recent Delphix/Illumos change set, you can melt down a Mac by creating a zvol holding a jhfs+ time machine (native, not sparsebundle) on a fast source pool then zfs recv it into a dataset that inherits dedup=sha256,compression=lz4,checksum=sha256 and which lives in a pool that is slow to write to (e.g. an n-way mirror where n > 2) but which has lots of l2arc headroom for the DDT. Even as a silly experiment that you'll abort and throw away, it'll make you pine for mdb or other ways to dynamically tweak the new zfs_dirty_data_max and zfsvdev* variables in zfs/vdev_queue.c for example, and in short order it'll make you pine for the ability to do anything at all without fifteen minute waits on system responsiveness.

P.P.S.: not panicking or locking up under that load is a great achievement not shared by OmniOS r151008; r151010 is worse than O3X at importing a pool that has a large freeing property arising from a restart while the system is dealing with zfs destroys on highly-deduped snapshots. I'll revisit that again when the dust settles on less totally insane ways to make a system busy and/or crashy, partly because Illumos doesn't use resources very well when importing a dirty pool.

evansus commented 10 years ago

Interesting - I'll try to replicate some of these tasks in my testing VMs, and on a separate partition on metal.

I already have several large zvols that I can hammer this with, but I don't have dedup enabled anywhere so that would be an interesting test. Especially the send/recv you mentioned.

I'm also running swupdate server on zvol, hosted & local Time Machine on zvols, etc. That's mostly on a 'stable' box running vdev-iokit from around May 26th. I'll test on that machine shortly.

Another Mac is booting from a 256gb zvol, so that has arc memory pressure + lots of small read/writes, but also hasn't been updated since around the same time. That iokit-boot branch is based on vdev-iokit, with zvol-unmap and some other tweaks merged in.

rottegift commented 10 years ago

I've had a couple of lockups which correlate with heavy l2arc reads, including an ugly double panic. Unfortunately this has led to no debugging information. Master does not appear to do it though, nor did vdev-iokit of a few days earlier.

evansus commented 10 years ago

I updated my laptop and Mac Mini to use the latest vdev-iokit without any issues. However both machines had a single data vdev only, until today.

I've added a cache vdev and log vdev to the Mini's pool. Still haven't replicated any panics or deadlocks. I wonder if having multiple cache vdevs would 'help' to replicate this.

I've done some incremental zfs send, local Time Machine from HFS to ZVOL+HFS, and remote TM from ZVOL+HFS to an AFP server. I definitely need to apply more strenuous tests though. I'm thinking dedup will be a better test, especially a zfs recv that inherits dedup, like you mentioned.

What flags do you build spl and zfs with? Here's what I'm using:

cd <devdir/spl>
./autogen.sh
./configure CC=clang CXX=clang++ OBJCXX=clang++ CFLAGS=-g
make 2>&1 | tee make.log
sudo make install DESTDIR=/stage-vdev 2>&1 | tee stage.log
sudo make install

cd <devdir/zfs>
./autogen.sh
./configure CC=clang CXX=clang++ OBJCXX=clang++ CFLAGS=-g --with-spl=<path/to/spl>
make 2>&1 | tee make.log
sudo make install DESTDIR=/stage-vdev 2>&1 | tee stage.log
sudo make install

Typically I'll check the make.log and stage.log, then rsync from the stage dir to the target system - this time around I opted for sudo make install without DESTDIR.

rottegift commented 10 years ago

I'm lazy and reasonably trusting of you committers and just move the spl&zfs kexts then run the latest zfsadm -z vdev-iokit; cd Developer/spl; make install; cd ../zfs; make install; shutdown -r time.

vdev-iokit has been pretty solid; uptimes of 12+ hours under fair load. It's not too surprising that a much more lightly loaded system would have even longer MTBPs.

More cache vdevs is worth a try; file-or-zvol-backed ones are slow but "free" :-)

rottegift commented 10 years ago

I can semi-reliably cause a double-panic or total lock-up (that includes unresponsive network interfaces) by doing a "diskutil verifyvolume DONKEYTM", with one or three cache vdevs, especially if I do "sysctl zfs.l2arc_feed_min_ms=100 zfs.l2arc_norw=1 zfs.l2arc_noprefetch=0". It happens faster if the cache vdev ashift is 9 instead of 12 or 13.

With the only cache dev being a USB3 pen drive which trades off a really good blinking light implementation for relatively weak performance, the total-hang leaves the r/w access light stuck on. Substituting a Patriot Memory drive (no blinking light at all but a well-performing device) has so far avoided a crash.

rottegift commented 10 years ago

Oh, and DONKEYTM is a local time machine destination.

It's not fully deduplicated, just the first ca. 30GiB and the data which has been added since ~9 Jun. The pool is a three-sided mirror and it would have taken forever to zfs recv the whole zvol with dedup on.

NAME       PROPERTY               VALUE                  SOURCE
Donkey/TM  type                   volume                 -
Donkey/TM  creation               Mon Jun  9 10:45 2014  -
Donkey/TM  used                   213G                   -
Donkey/TM  available              171G                   -
Donkey/TM  referenced             198G                   -
Donkey/TM  compressratio          1.50x                  -
Donkey/TM  reservation            none                   default
Donkey/TM  volsize                360G                   local
Donkey/TM  volblocksize           8K                     -
Donkey/TM  checksum               sha256                 received
Donkey/TM  compression            lz4                    inherited from Donkey
Donkey/TM  readonly               off                    default
Donkey/TM  copies                 1                      default
Donkey/TM  refreservation         none                   default
Donkey/TM  primarycache           all                    default
Donkey/TM  secondarycache         all                    default
Donkey/TM  usedbysnapshots        15.5G                  -
Donkey/TM  usedbydataset          198G                   -
Donkey/TM  usedbychildren         0                      -
Donkey/TM  usedbyrefreservation   0                      -
Donkey/TM  logbias                latency                default
Donkey/TM  dedup                  sha256                 local
Donkey/TM  mlslabel               none                   default
Donkey/TM  sync                   standard               default
Donkey/TM  refcompressratio       1.45x                  -
Donkey/TM  written                2.32G                  -
Donkey/TM  logicalused            318G                   -
Donkey/TM  logicalreferenced      286G                   -
Donkey/TM  snapdev                hidden                 default

Also, this is really nice to see. L2ARC is useful.

                 capacity     operations    bandwidth
pool          alloc   free   read  write   read  write
------------  -----  -----  -----  -----  -----  -----
Donkey        1.62T   201G    549     19  1.17M  80.2K
  mirror      1.62T   201G    549     19  1.17M  80.2K
    disk19s2      -      -    112      8   465K  80.6K
    disk20s2      -      -    119      7   554K  80.6K
    disk22s2      -      -    125      8   608K  80.6K
logs              -      -      -      -      -      -
  mirror       280K   944M      0      0      0      0
    disk16s1      -      -      0      0      0      0
    disk11s1      -      -      0      0      0      0
cache             -      -      -      -      -      -
  disk29      2.35G   115G  1.17K      4  2.66M   514K

rottegift commented 10 years ago

Ok, under heavy load this points to a difference beteen zfs.l2arc_feed_min_ms=200 and zfs.l2arc_feed_min_ms=100; the latter quickly provokes a panic or total lock-up under heavy load.

Heavy load here is a diskutil verifyvolume on a time machine target zvol at the same time a remote time machine is writing into another zvol (and also some other minor I/O) on the same pool.

I haven't gotten around to physically removing the cache vdev device while it's in use, but that would help distinguish between "device goes away under load, causes crash" and "device is still there, load causes crash".

rottegift commented 10 years ago

The system hanged in the night and after a forced reboot panicked during auto-import (two pools the zvols on one of them were up, according to the zed syslogs):

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Mon Jun 23 11:17:21 2014
panic(cpu 2 caller 0xffffff8027adbf5e): Kernel trap at 0xffffff7fa8075e91, type 13=general protection, registers:
CR0: 0x0000000080010033, CR2: 0x00007fd4daf40000, CR3: 0x00000000804ad09e, CR4: 0x00000000001606e0
RAX: 0x979195dc0f010f04, RBX: 0xffffff824596bf10, RCX: 0x979195dc0f010f04, RDX: 0xffffff82175b9e38
RSP: 0xffffff824596ba20, RBP: 0xffffff824596ba50, RSI: 0xffffff82175b9e28, RDI: 0xffffff82175b9e28
R8:  0x0000000000000000, R9:  0xffffff8028001910, R10: 0x00000000000003ff, R11: 0xffffffffffffffff
R12: 0x0000000000100003, R13: 0xffffff804fbfd0f0, R14: 0xffffff824596be70, R15: 0x0000000000000048
RFL: 0x0000000000010282, RIP: 0xffffff7fa8075e91, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x00007fd4daf40000, Error code: 0x0000000000000000, Fault CPU: 0x2

Backtrace (CPU 2), Frame : Return Address
0xffffff8207fdddf0 : 0xffffff8027a22fa9 mach_kernel : _panic + 0xc9
0xffffff8207fdde70 : 0xffffff8027adbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff8207fde040 : 0xffffff8027af3456 mach_kernel : _return_from_trap + 0xe6
0xffffff8207fde060 : 0xffffff7fa8075e91 net.lundman.zfs : _dsl_dir_tempreserve_clear + 0x61
0xffffff824596ba50 : 0xffffff7fa805c6af net.lundman.zfs : _dmu_tx_commit + 0x10f
0xffffff824596ba80 : 0xffffff7fa8108110 net.lundman.zfs : _zfs_write + 0x1120
0xffffff824596bd00 : 0xffffff7fa8116c4e net.lundman.zfs : _zfs_vnop_write + 0xae
0xffffff824596bd60 : 0xffffff8027bfde51 mach_kernel : _VNOP_WRITE + 0xe1
0xffffff824596bde0 : 0xffffff8027bf3be3 mach_kernel : _utf8_normalizestr + 0x703
0xffffff824596be50 : 0xffffff8027df22f1 mach_kernel : _write_nocancel + 0x1b1
0xffffff824596bef0 : 0xffffff8027df21dd mach_kernel : _write_nocancel + 0x9d
0xffffff824596bf50 : 0xffffff8027e40653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff824596bfb0 : 0xffffff8027af3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[0C7B3320-A055-3936-B86A-8C6E3EC2CB6D]@0xffffff7fa802b000->0xffffff7fa823cfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fa7fec000
            dependency: net.lundman.spl(1.0.0)[63543165-E11D-384D-8456-F485EA93FC7E]@0xffffff7fa801a000

rottegift commented 10 years ago

After returning to this branch from crashy master (https://github.com/openzfsonosx/zfs/issues/202 which will updated in a second), this panic.

(Messy. Something to do with a zvol, I guess) because of

net.lundman.zfs : _vdev_iokit_io_intr + 0x2c
com.apple.iokit.IOStorageFamily : __ZN20IOBlockStorageDriver24prepareRequesN

Sadly the rest of the stack got eaten.

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Mon Jun 23 14:22:42 2014
panic(cpu 0 caller 0xffffff8023edbf5e): Kernel trap at 0xffffff7fa5bb26dc, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0x0000000000000000, CR3: 0x00000000261ae000, CR4: 0x00000000001606e0
RAX: 0xffffff7fa5bb26b0, RBX: 0xffffff834f7028d8, RCX: 0x0000000000000800, RDX: 0x0000000000000000
RSP: 0xffffff820ed0ba20, RBP: 0xffffff820ed0ba40, RSI: 0xffffff839c1d1650, RDI: 0x0000000000000000
R8:  0x0000000000000800, R9:  0x0000000000000000, R10: 0xffffff8024470800, R11: 0xffffff7fa4792994
R12: 0xffffff839c1d1650, R13: 0x0000000000000002, R14: 0x0000000000000800, R15: 0x0000000000000000
RFL: 0x0000000000010286, RIP: 0xffffff7fa5bb26dc, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000000000000, Error code: 0x0000000000000000, Fault CPU: 0x0

Backtrace (CPU 0), Frame : Return Address
0xffffff820ed0b6b0 : 0xffffff8023e22fa9 mach_kernel : _panic + 0xc9
0xffffff820ed0b730 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b900 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b920 : 0xffffff7fa5bb26dc net.lundman.zfs : _vdev_iokit_io_intr + 0x2c
0xffffff820ed0ba40 : 0xffffff7fa43f6ab2 com.apple.iokit.IOStorageFamily : __ZN20IOBlockStorageDriver24prepareRequestCompletionEPvS0_iy + 0xc0
0xffffff820ed0ba90 : 0xffffff7fa46e12ca Backtrace (CPU 0), Frame : Return Address
0xffffff820ed0b070 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0b0a0 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0b120 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b2f0 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b310 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0b460 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0b670 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0b6b0 : 0xffffff8023e22fa9 mach_kernel : _panic + 0xc9
0xffffff820ed0b730 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b900 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b920 : 0xffffff7fa5bb26dc net.lundman.zfs : _vdev_iokit_io_intr + 0x2c
0xffffff820ed0ba40 : 0xffffff7fa43f6ab2 com.apple.iokit.IOStorageFamily : __ZN20IOBlockStorageDriver24prepareRequestCompletionEPvS0_iy + 0xc0
0xffffff820ed0ba90 : 0xffffff7fa46e12ca Backtrace (CPU 0), Frame : Return Address
0xffffff820ed0aa30 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0aa60 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0aae0 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0acb0 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0acd0 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0ae20 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0b030 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0b070 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0b0a0 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0b120 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b2f0 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b310 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0b460 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0b670 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0b6b0 : 0xffffff8023e22fa9 mach_kernel : _panic + 0xc9
0xffffff820ed0b730 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b900 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b920 : 0xffffff7fa5bb26dc net.lundman.zfs : _vdev_iokit_io_intr + 0x2c
0xffffff820ed0ba40 : 0xffffff7fa43f6ab2 com.apple.iokit.IOStorageFamily : __ZN20IOBlockStorageDriver24prepareRequestCompletionEPvS0_iy + 0xc0
0xffffff820ed0ba90 : 0xffffff7fa46e12ca Backtrace (CPU 0), Frame : Return Address
0xffffff820ed0a3f0 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0a420 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0a4a0 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0a670 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0a690 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0a7e0 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0a9f0 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0aa30 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0aa60 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0aae0 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0acb0 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0acd0 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0ae20 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0b030 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0b070 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0b0a0 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0b120 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b2f0 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b310 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0b460 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0b670 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0b6b0 : 0xffffff8023e22fa9 mach_kernel : _panic + 0xc9
0xffffff820ed0b730 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b900 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b920 : 0xffffff7fa5bb26dc net.lundman.zfs : _vdev_iokit_io_intr + 0x2c
0xffffff820ed0ba40 : 0xffffff7fa43f6ab2 com.apple.iokit.IOStorageFamily : __ZN20IOBlockStorageDriver24prepareRequestCompletionEPvS0_iy + 0xc0
0xffffff820ed0ba90 : 0xffffff7fa46e12ca Backtrace (CPU 0), Frame : Return Address
0xffffff820ed09db0 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed09de0 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed09e60 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0a030 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0a050 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0a1a0 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0a3b0 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0a3f0 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0a420 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0a4a0 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0a670 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0a690 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0a7e0 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0a9f0 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0aa30 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0aa60 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0aae0 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0acb0 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0acd0 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0ae20 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0b030 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0b070 : 0xffffff8023e2320d mach_kernel : __consume_panic_args + 0x19d
0xffffff820ed0b0a0 : 0xffffff8023e22f2f mach_kernel : _panic + 0x4f
0xffffff820ed0b120 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b2f0 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b310 : 0xffffff8023ee09f0 mach_kernel : _machine_boot_info + 0x160
0xffffff820ed0b460 : 0xffffff8023ee074f mach_kernel : _panic_i386_backtrace + 0x31f
0xffffff820ed0b670 : 0xffffff8023ee0294 mach_kernel : _Debugger + 0xa4
0xffffff820ed0b6b0 : 0xffffff8023e22fa9 mach_kernel : _panic + 0xc9
0xffffff820ed0b730 : 0xffffff8023edbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820ed0b900 : 0xffffff8023ef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820ed0b920 : 0xffffff7fa5bb26dc net.lundman.zfs : _vdev_iokit_io_intr + 0x2c
0xffffff820ed0ba40 : 0xffffff7fa43f6ab2 com.apple.iokit.IOStorageFamily : __ZN20IOBlockStorageDriver24prepareReques
Model: Macmini6,2, BootROM MM61.0106.B03, 4 processors, Intel Core i7, 2.6 GHz, 16 GB, SMC 2.8f0

rottegift commented 10 years ago

vdev-iokit branch, tip of tree today:

This may be a coincidence of timing:

One of the mirror vdevs in a pool came up degraded because one disk was missing.

I began an export on the pool, and as it was taking its time I power-cycled the disk that zfs was not seeing, wondering if that would provoke zfs into seeing the disk. The panic struck very soon afterwards.

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Wed Jun 25 18:27:42 2014
panic(cpu 2 caller 0xffffff802f4dbf5e): Kernel trap at 0xffffff7fb13bba5c, type 13=general protection, registers:
CR0: 0x0000000080010033, CR2: 0x00007fa050517000, CR3: 0x0000000173e990b5, CR4: 0x00000000001606e0
RAX: 0xffffff7fb13bba30, RBX: 0xffffff8251a6bd58, RCX: 0x7261646e6f636573, RDX: 0x0000000000000000
RSP: 0xffffff8251a6b978, RBP: 0xffffff8251a6b9a0, RSI: 0x002a10245c616adb, RDI: 0x7261646e6f636573
R8:  0xffffff7fb136c060, R9:  0x0000000000000220, R10: 0xffffff8251a6b5a0, R11: 0xffffff8251a6b700
R12: 0x000000002200016f, R13: 0xffffff80642300f0, R14: 0x000000002200016f, R15: 0xffffff8057bcfa80
RFL: 0x0000000000010296, RIP: 0xffffff7fb13bba5c, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x00007fa050517000, Error code: 0x0000000000000000, Fault CPU: 0x2

Backtrace (CPU 2), Frame : Return Address
0xffffff820f9cddf0 : 0xffffff802f422fa9 mach_kernel : _panic + 0xc9
0xffffff820f9cde70 : 0xffffff802f4dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff820f9ce040 : 0xffffff802f4f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff820f9ce060 : 0xffffff7fb13bba5c net.lundman.zfs : _unique_compare + 0x2c
0xffffff8251a6b9a0 : 0xffffff7fb1440e39 net.lundman.zfs : _avl_find + 0x69
0xffffff8251a6b9f0 : 0xffffff7fb13bbb94 net.lundman.zfs : _unique_insert + 0x94
0xffffff8251a6ba20 : 0xffffff7fb136c004 net.lundman.zfs : _dsl_dataset_hold_obj + 0x584
0xffffff8251a6bad0 : 0xffffff7fb136c252 net.lundman.zfs : _dsl_dataset_hold + 0x92
0xffffff8251a6bb40 : 0xffffff7fb13521e8 net.lundman.zfs : _dmu_objset_hold + 0x58
0xffffff8251a6bb90 : 0xffffff7fb13f8c30 net.lundman.zfs : _zfs_ioc_objset_stats + 0x20
0xffffff8251a6bbc0 : 0xffffff7fb13f8f8e net.lundman.zfs : _zfs_ioc_dataset_list_next + 0x19e
0xffffff8251a6bc20 : 0xffffff7fb13f4ca4 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff8251a6bcf0 : 0xffffff802f60d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff8251a6bd40 : 0xffffff802f5fe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff8251a6bdc0 : 0xffffff802f5f3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff8251a6be10 : 0xffffff802f7c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff8251a6be40 : 0xffffff802f7f2c66 mach_kernel : _ioctl + 0x466
0xffffff8251a6bf50 : 0xffffff802f840653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff8251a6bfb0 : 0xffffff802f4f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[AD254B6A-46EA-38FC-9F71-9EAE64EC74D7]@0xffffff7fb1331000->0xffffff7fb1541fff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7faf9ec000
            dependency: net.lundman.spl(1.0.0)[626E72AC-DE97-3631-ADC2-0849D468CF64]@0xffffff7fafa1a000

BSD process name corresponding to current thread: zfs
Boot args: -v keepsyms=y darkwake=0

evansus commented 10 years ago

Seems odd that you're experiencing a panic in unique_insert-> avl_find-> unique_compare. @lundman perhaps you could take a look? I noticed in the comments on #202 that we're suspecting possible memory corruption. I suppose it could be clobbered memory, but it also could be a hardware issue, no?

@rottegift You've been experiencing panics on master and vdev-iokit branches that I can't replicate... are all these panics happening on the same Mac? Perhaps a few passes of memtest is in order? Apple Hardware Test (Pre- June, 2013) Apple Diagnostics (Post- June, 2013)

rottegift commented 10 years ago

@rottegift You've been experiencing panics on master and vdev-iokit branches that I can't replicate... are all these panics happening on the same Mac? Perhaps a few passes of memtest is in order?

No, it’s on a pair of machines, and did that on one of the minis, and will do it on the other too in due course.

evansus commented 10 years ago

Interesting, a ram issue is unlikely in that case. On another note, have you tested with cpus=1 in boot.plist (or just set in nvram)?

rottegift commented 10 years ago

"have you tested with cpus=1"

No, not at all. If the panics and/or hangs recur, I will.

rottegift commented 10 years ago

In looking around I believe that some builds I did were done with the wrong clang (macports 3.4 latest built from source rather than xcode's clang). I've now quintuple-checked everything on that front, and will hope for no-recurrence.

rottegift commented 10 years ago

Hmmm, nope. Another one (two scrubs were in progress).

I'll try the cpus=1 boot arg now.

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Thu Jun 26 11:54:33 2014
panic(cpu 4 caller 0xffffff802cedbf5e): Kernel trap at 0xffffff7fad41cbea, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0x0000000000000340, CR3: 0x000000002f3d5000, CR4: 0x00000000001606e0
RAX: 0x0000000000000340, RBX: 0x0000000000000340, RCX: 0xffffff82d5747d18, RDX: 0xffffff82d5761ad0
RSP: 0xffffff823061bc30, RBP: 0xffffff823061bc40, RSI: 0xffffff82e6265e48, RDI: 0x0000000000000340
R8:  0x000000000000003f, R9:  0x0000000000000000, R10: 0xffffff802d470800, R11: 0x0000000000000000
R12: 0xffffff821b7a07e8, R13: 0x000000000001fa74, R14: 0xffffff821b7a07c8, R15: 0xffffff8232d0b5b8
RFL: 0x0000000000010202, RIP: 0xffffff7fad41cbea, CS:  0x0000000000000008, SS:  0x0000000000000000
Fault CR2: 0x0000000000000340, Error code: 0x0000000000000000, Fault CPU: 0x4

Backtrace (CPU 4), Frame : Return Address
0xffffff823061b8c0 : 0xffffff802ce22fa9 mach_kernel : _panic + 0xc9
0xffffff823061b940 : 0xffffff802cedbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff823061bb10 : 0xffffff802cef3456 mach_kernel : _return_from_trap + 0xe6
0xffffff823061bb30 : 0xffffff7fad41cbea net.lundman.spl : _spl_mutex_enter + 0xa
0xffffff823061bc40 : 0xffffff7fad4c5efc net.lundman.zfs : _vdev_mirror_scrub_done + 0x6c
0xffffff823061bc70 : 0xffffff7fad5311f9 net.lundman.zfs : _zio_done + 0xff9
0xffffff823061bd90 : 0xffffff7fad52c78a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff823061bdd0 : 0xffffff7fad5313d2 net.lundman.zfs : _zio_done + 0x11d2
0xffffff823061bef0 : 0xffffff7fad52c78a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff823061bf30 : 0xffffff7fad52c655 net.lundman.zfs : _zio_execute + 0x15
0xffffff823061bf50 : 0xffffff7fad41e217 net.lundman.spl : _taskq_thread + 0xc7
0xffffff823061bfb0 : 0xffffff802ced7127 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[205406D0-4396-3572-B257-19B5A81B1084]@0xffffff7fad41a000->0xffffff7fad42afff
         net.lundman.zfs(1.0)[CA9C82FD-1FA0-3927-A8B9-5DFB3141B3FD]@0xffffff7fad42b000->0xffffff7fad63cfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7fad3ec000
            dependency: net.lundman.spl(1.0.0)[205406D0-4396-3572-B257-19B5A81B1084]@0xffffff7fad41a000

BSD process name corresponding to current thread: kernel_task
Boot args: -v keepsyms=y darkwake=0

rottegift commented 10 years ago

Biting bullet and doing extended testing in apple hardware test.

Memtest all 2 was ok in single user mode.

rottegift commented 10 years ago

"Test results: No trouble found" "Total Time Testing: 1 hour 9 mins 42 secs"

So I think we can rule out hardware issues.

ilovezfs commented 10 years ago

@rottegift Is this limited to vdev_iokit?

rottegift commented 10 years ago

@ilovezfs: no.

I'm now trying with cpus=1 as suggested above. It's very slow. What will come first, panic or death-by-old-age? :-)

Assuming this does crash the next step is to build a totally pristine new boot drive with nothing but zfs (ideally the latest ftp build) on it and try to replicate conditions in which crashes happen (which mostly seems to be lots of small writes onto slow media, especially into zvols).

rottegift commented 10 years ago

Hm, I don't suppose you already know some bare minimum set of LaunchDaemons needed to make zfs work (pool import/export, dataset mount/unmount/etc) in single user mode?

ilovezfs commented 10 years ago

I'd just do sudo make install.

I wonder if this is zed/zpool.cache related.

https://github.com/openzfsonosx/zfs/blob/e34a376027751ee1e44553fcec5a5c1a611283d3/module/zfs/spa_config.c#L70

Maybe try building with zfs_autoimport_disable = 1, and do not install the plists.

rottegift commented 10 years ago

I'll try zfs_autoimport_disable = 1 (and master) after running for a bit with cpus=1.

The same vdev_iokit source code is on another mac (I'm typing on it now) running 10.8 and with a vastly different load pattern, and has significantly longer uptimes in recent days. Autoimport is working just fine on this machine.

On the crashier machine (which was the subject of the hw tests above) vdev-iokit happily autoimports and runs for hours.

So I'm not sure if it will make much difference, but I'll try to just to eliminate your lines of thought in suggesting it.

rottegift commented 10 years ago

Well, cpus=1 is nice for showing off how good Mac OS X (and zfs) are at multithreading and multiprocessing.

$ sysctl vm.loadavg vm.loadavg: { 336.91 322.35 320.98 }

Those sha256 checksums sure keep a single processor busy.

rottegift commented 10 years ago

@evansus I think the only thing cpus=1 is telling me is that even a light load on this machine cannot be approached by a one-cpu system.

It's only: imported ssdpool (which is doing a scrub, so lots of sha256), imported Trinity (ditto), fired up the UI and Safari,Terminal,Activity Monitor,Console and a couple of others, and has been waiting about ten minutes for "$ sudo -s" to even display a Password: prompt. I'm going to attempt a graceful shutdown when it gives me the sudo shell, since ssh-ing in is hopeless and there is nothing giving any useful ability to inspect the workings of the system.

evansus commented 10 years ago

Interesting - looks like the vdev_mirror_scrub_done panic is somewhat repeatable, maybe only at the end of a scrub.

I wonder if that could be reproduced by creating a small test pool and running zpool scrub on it, first without cpus=1, then with if replicated (and of course master vs vdev-iokit). Otherwise yes, scrubbing the large pools with cpus=1 would take an eternity to complete. :)

I still haven't replicated that panic, but incidentally I don't think I've been using checksum=sha256 on test pools. Typically I use something close to zpool create -o ashift=12 -o failmode=continue -O canmount=off -O mountpoint=none -O atime=off where it defaults to checksum=fletcher4. I'm modifying my test scripts now.

But I take it there haven't been any panics while running single-cpu, which might indicate a synchronization issue. It's tricky since replicating one of the specific panics while using cpus=1 would confirm it is not a multi-threading issue.

rottegift commented 10 years ago

Yes, I understand the idea of eliminating a whole range of synchronization and contention problems in going with one cpu, however the system hasn't even managed to mount all three of the pools it normally does, let alone reach a point at which I can do one of the tasks that I think are most likely to correlate with panics.

Yes, checksum=sha256 on practically everything, and compression=lz4 on everything.

A typ9ical create for me is

zpool create -O normalization=formD -O checksum=sha256 -O casesensitivity=insensitive -o ashift=12 homepool mirror disk3s2 disk27s2 log mirror disk25s1 disk4s1 cache disk7s2 disk24

I even do that sort of thing on non-rpool pools on illumos

2013-12-25.04:25:00 zpool create -f -O normalization=formD -O casesensitivity=insensitive -O compression=lz4 -O checksum=sha256 Newmis raidz c9t0d0s1 c9t9d0s1 c9t8d0s1 cache c9t11d0s0

for example.

evansus commented 10 years ago

Actually I take that back, laptop and mini's pools are checksum=sha256, only the test pools created in a VM had default checksum.

Also I did experience a panic when attempting a scrub on my laptop, but it is a unique scenario: Panic log shows vdev_hold attempts to call vdev_op_hold, which is NULL (unimplemented vdev_iokit_hold and rele). vdev_hold/rele are only called in the case of a root pool, and this laptop is a bit 'unique' in that way ;) https://gist.github.com/evansus/910f208cb6c2cfd9bcab

That issue is fixable, but unrelated. I need to test this on my Mac Mini, after updating it to ToT vdev-iokit, and in VMs with my updated hotspares script.

rottegift commented 10 years ago

@ilovezfs : " zfs_autoimport_disable = 1 "

I'm going to try your latest commits in master for a bit without this, and then with it. I'll raise any problems in another issue, and leave this one only for vdev_iokit for now (unless you prefer it to all be here until we figure out what's causing these panics and hangs).

rottegift commented 10 years ago

A couple more while dealing with the aftermath of the comments attached at the end of https://github.com/openzfsonosx/zfs/commit/95ff8054f02b464c2d077af760aee0ffbd28ab5c

These two were in vdev-iokit as I was replacing devices that had gotten stomped with bad labels.

The first one was perhaps triggered by a zfs send sourced from one of the DEGRADED pools.

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Fri Jun 27 00:01:18 2014
panic(cpu 0 caller 0xffffff7f85d6dda5): "VERIFY(" "nvlist_add_nvlist(nv, propname, propval) == 0" ") failed\n"@dsl_prop.c:961
Backtrace (CPU 0), Frame : Return Address
0xffffff81ff03b660 : 0xffffff8004022fa9 mach_kernel : _panic + 0xc9
0xffffff81ff03b6e0 : 0xffffff7f85d6dda5 net.lundman.zfs : _dsl_prop_get_all_impl + 0x535
0xffffff81ff03b9e0 : 0xffffff7f85d6d3f2 net.lundman.zfs : _dsl_prop_get_all_ds + 0xf2
0xffffff81ff03bb40 : 0xffffff7f85d6d2f5 net.lundman.zfs : _dsl_prop_get_all + 0x25
0xffffff81ff03bb60 : 0xffffff7f85de732d net.lundman.zfs : _zfs_ioc_objset_stats_impl + 0x4d
0xffffff81ff03bba0 : 0xffffff7f85de413b net.lundman.zfs : _zfs_ioc_snapshot_list_next + 0x1ab
0xffffff81ff03bc20 : 0xffffff7f85ddfc64 net.lundman.zfs : _zfsdev_ioctl + 0x664
0xffffff81ff03bcf0 : 0xffffff800420d63f mach_kernel : _spec_ioctl + 0x11f
0xffffff81ff03bd40 : 0xffffff80041fe000 mach_kernel : _VNOP_IOCTL + 0x150
0xffffff81ff03bdc0 : 0xffffff80041f3e51 mach_kernel : _utf8_normalizestr + 0x971
0xffffff81ff03be10 : 0xffffff80043c1303 mach_kernel : _fo_ioctl + 0x43
0xffffff81ff03be40 : 0xffffff80043f2c66 mach_kernel : _ioctl + 0x466
0xffffff81ff03bf50 : 0xffffff8004440653 mach_kernel : _unix_syscall64 + 0x1f3
0xffffff81ff03bfb0 : 0xffffff80040f3c56 mach_kernel : _hndl_unix_scall64 + 0x16
      Kernel Extensions in backtrace:
         net.lundman.zfs(1.0)[CA9C82FD-1FA0-3927-A8B9-5DFB3141B3FD]@0xffffff7f85d1c000->0xffffff7f85f2cfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f84604000
            dependency: net.lundman.spl(1.0.0)[205406D0-4396-3572-B257-19B5A81B1084]@0xffffff7f845f3000

BSD process name corresponding to current thread: zfs
Boot args: -v keepsyms=y darkwake=0

and

Anonymous UUID:       EA3E4DC2-8F4D-9BF6-7D16-4BB6CA19A914

Fri Jun 27 00:35:00 2014
panic(cpu 0 caller 0xffffff80036dbf5e): Kernel trap at 0x0000000000000400, type 14=page fault, registers:
CR0: 0x000000008001003b, CR2: 0x0000000000000400, CR3: 0x0000000005ba2000, CR4: 0x00000000001606e0
RAX: 0x0000000000000400, RBX: 0xffffff81f25d6d58, RCX: 0xffffff8234326328, RDX: 0xffffff81f90c8d78
RSP: 0xffffff8217b73958, RBP: 0xffffff8217b739b0, RSI: 0xffffff823d2f5ad8, RDI: 0xffffff8210500ba0
R8:  0x0000000000000000, R9:  0xffffff8003c01910, R10: 0x00000000000003ff, R11: 0xffffffffffffffff
R12: 0xffffff81f25d6d98, R13: 0x000000000003cd02, R14: 0xffffff81f25d6d78, R15: 0xffffff81f91aa148
RFL: 0x0000000000010206, RIP: 0x0000000000000400, CS:  0x0000000000000008, SS:  0x0000000000000010
Fault CR2: 0x0000000000000400, Error code: 0x0000000000000010, Fault CPU: 0x0

Backtrace (CPU 0), Frame : Return Address
0xffffff8217b735e0 : 0xffffff8003622fa9 mach_kernel : _panic + 0xc9
0xffffff8217b73660 : 0xffffff80036dbf5e mach_kernel : _kernel_trap + 0x7fe
0xffffff8217b73830 : 0xffffff80036f3456 mach_kernel : _return_from_trap + 0xe6
0xffffff8217b73850 : 0x400 
0xffffff8217b739b0 : 0xffffff7f83d311f9 net.lundman.zfs : _zio_done + 0xff9
0xffffff8217b73ad0 : 0xffffff7f83d2c78a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff8217b73b10 : 0xffffff7f83d313d2 net.lundman.zfs : _zio_done + 0x11d2
0xffffff8217b73c30 : 0xffffff7f83d2c78a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff8217b73c70 : 0xffffff7f83d313d2 net.lundman.zfs : _zio_done + 0x11d2
0xffffff8217b73d90 : 0xffffff7f83d2c78a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff8217b73dd0 : 0xffffff7f83d313d2 net.lundman.zfs : _zio_done + 0x11d2
0xffffff8217b73ef0 : 0xffffff7f83d2c78a net.lundman.zfs : ___zio_execute + 0x12a
0xffffff8217b73f30 : 0xffffff7f83d2c655 net.lundman.zfs : _zio_execute + 0x15
0xffffff8217b73f50 : 0xffffff7f83c1e217 net.lundman.spl : _taskq_thread + 0xc7
0xffffff8217b73fb0 : 0xffffff80036d7127 mach_kernel : _call_continuation + 0x17
      Kernel Extensions in backtrace:
         net.lundman.spl(1.0)[205406D0-4396-3572-B257-19B5A81B1084]@0xffffff7f83c1a000->0xffffff7f83c2afff
         net.lundman.zfs(1.0)[CA9C82FD-1FA0-3927-A8B9-5DFB3141B3FD]@0xffffff7f83c2b000->0xffffff7f83e3cfff
            dependency: com.apple.iokit.IOStorageFamily(1.9)[9B09B065-7F11-3241-B194-B72E5C23548B]@0xffffff7f83bec000
            dependency: net.lundman.spl(1.0.0)[205406D0-4396-3572-B257-19B5A81B1084]@0xffffff7f83c1a000

BSD process name corresponding to current thread: kernel_task
Boot args: -v keepsyms=y darkwake=0

lundman commented 10 years ago

0xffffff8217b73850 : 0x400 
0xffffff8217b739b0 : 0xffffff7f83d311f9 net.lundman.zfs : _zio_done + 0xff9

So zio_done jumps to 0x400. But zio has no function-ptrs that it calls, only symbols. Certainly seems to support memory corruption. But is it from code bugs or not, is the question. These weird panics are only in vdev-iokit?

rottegift commented 10 years ago

I got a panic in master earlier but unfortunately it did not leave a crashdump. :-(

Previous Next