openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
823 stars 72 forks source link

Machine hangs/panics? on AFP #250

Closed jollyjinx closed 9 years ago

jollyjinx commented 9 years ago

I've moved from ZEVO to O3X. I'm using ZFS on a mac mini 2012 (10.8), 2x256 internal SSDs (caches) and 5x 2TB WD Passport drives ( 4xUSB3.0, 1XFW800) in raidz2 configuration. I have a HDMI screendongle attached to the machine so screensharing works fast), as no real screen is attached to the machine. I haven't had problems with the machine for the last two years and I'm doing a scrub every two weeks.

I use the machine to record Sat>IP television streams so I have a lot of tv shows on the machine. After switching form ZEVO to O3X I had now three kernel panics with HEAD . I can REPLICATE the kernel panic by doing the following:

I mount the ZFS dataset via AFP on a MacBook ( both connected via gigabit) then I do a zsh commandline:

KRO>cd /Volumes/media/Media/TV/Tatort/
KRO>for i in */*.mp4
for> do
for> dd if=$i of=/dev/null bs=10m
for> done

So it's just reading the gigabtye large episodes via dd over AFP. After about 40 to 70 GBytes the dd has a input/output error and following that are Socket not connected errors:

774222764 bytes transferred in 10.133685 secs (76400910 bytes/sec)
111+1 records in
111+1 records out
1171012259 bytes transferred in 14.534614 secs (80567138 bytes/sec)
  dd: Lenz/1987.01.25-Lenz-Die Macht des Schicksals.mp4: Input/output error
240+0 records in
240+0 records out
2516582400 bytes transferred in 220.603747 secs (11407705 bytes/sec)
dd: Lutz/1978.01.01-Lutz-Rot – rot – tot.mp4: Socket is not connected
dd: Lürsen/2013.06.09-Luersen-Er wird toeten.mp4: Socket is not connected
dd: Lürsen/2014.05.18-Luersen-Alle meine Jungs.mp4: Socket is not connected
dd: Markowitz/1991.07.07-Markowitz-Tini.mp4: Socket is not connected
dd: Markowitz/1993.04.25-Markowitz-Tod einer alten Frau.mp4: Socket is not connected

I was able to reproduce that error three times in a row and the machines were not doing anything at the time I think it's clear that there is problem. I'm not seeing what has happend as I'm no longer able to connect to the machine - a ping works, but a ssh does not. Open ssh connections are dead at that time as well and I can't find any kernel panic logs.

I'm switching back to ZEVO for now - luckily the crash happend just before I wanted to upgrade the pool as it's still version 28 so I have a way back.

The third try I did with the latest commit b7b42569a57e12a02c91bd88d1d9f4b8135f4282

lundman commented 9 years ago

I have been trying to replicate this trouble all morning, with AFP between two OSX VMs, ZFS running on 10.10.1. Unfortunately so far, ZFS has behaved rather well.

It would be interesting to know if you can trigger this deadlock from the ZFS itself, without AFP.

It does sound like a deadlock of some kind, which was fixed with 8887fc0f07832ae2535de636c542e0b24f574e54 (and 5950324b7c0a80ff9e96af1bfd26ef7b40669137). But you do say you are up to date, so presumably still triggering something.

Have you tried leaving an SSH connection open the whole time? Often deadlocks take out vnodes, so logging in will fail (too many files to open) but a shell can be used "lightly" like running spindump.

Another option is to cause panic, connect lldb to dump all threads.

Still, I will leave the test machines to continue the file reading, perhaps something will trigger eventually.

lundman commented 9 years ago

Actually, irc brought up an interesting point, could you check if you have

ZFS: unlinked drain completed (0).

in your logs, and in particular, what the number is at the end. Its possible the weekend code, and a panic, made you have a bad unlinked drain, which will strike "at some time after" import. If that is the case, you want to set sysctl -w kstat.zfs.darwin.tunable.skip_unlinked_drain=1 before import to stop the crash.

brendonhumphrey commented 9 years ago

I too have had a go at reproducing your issue.

Used the same ksh/for/dd setup as you to transfer 89G of .m4v files (encoded dvds) from a Mac Pro running 10.10, dataset on a single SATA HDD, to an iMac via GigE using AFP.

Three iterations were completed without issue, transfer speeds in the 75 - 104 MB/Sec range.

No issues encountered.

jollyjinx commented 9 years ago

I have not found the unlinked drain in the logs. I have had a ssh connection to the machine and that hang as well as the machine.

I 'm in the process of reinstalling o3x, test if I can reproduce it again. If I can I will use the sysctl to see if that helps.

Update: Well, I've used Console.app to search for 'zfs' and with Console.app I did not see the undrained logs. I did now a 'cat /var/log/system.log' and now see the ZFS unlinked drain logs. And in fact there are unlink drain log messages:

I had:

32  "ZFS: unlinked drain completed (0)" messages
1   "ZFS: unlinked drain completed (1)" messages
1   "ZFS: unlinked drain completed (2)" messages
1   "ZFS: unlinked drain completed (3)" messages

I just installed the current version zfs ( b7b42569a57e12a02c91bd88d1d9f4b8135f4282 ) , spl ( e2053b8a9f8d615ebaff9f0624bc3934cbb4d74c )

The first 'zpool import' failed to import one of the disks in the pool and I could not online the disk:

mediamac:~ root# zpool status
  pool: ocean
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
    invalid.  Sufficient replicas exist for the pool to continue
    functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: resilvered 2.84T in 17h52m with 0 errors on Tue Nov 18 09:11:50 2014
config:

    NAME                     STATE     READ WRITE CKSUM
    ocean                    DEGRADED     0     0     0
      raidz2-0               DEGRADED     0     0     0
        disk3s2              ONLINE       0     0     0
        disk6s2              ONLINE       0     0     0
        disk5s2              ONLINE       0     0     0
        disk4s2              ONLINE       0     0     0
        4123860029131947953  FAULTED      0     0     0  was /dev/disk4s2
    logs
      mirror-1               ONLINE       0     0     0
        disk1s4              ONLINE       0     0     0
        disk0s4              ONLINE       0     0     0
    cache
      disk0s5                ONLINE       0     0     0
      disk1s5                ONLINE       0     0     0

errors: No known data errors
mediamac:~ root# zpool online ocean 4123860029131947953
warning: device '4123860029131947953' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

but a export/import then imported all disks correctly.

So then I tried the dd and again and sure enough the machine hung. I had a SSH open and another ssh doing a 'tail -f system.log' (so it would not loose it's filedescriptor), but when the machine hang nothing was shown in the tail nor were the ssh's responsive.

I then tried to do the dd localy and it hang the machine even quicker.

After the machine hang I rebooted and did a export, then the sysctl -w kstat.zfs.darwin.tunable.skip_unlinked_drain=1

imported and tried again and it - hang.

ilovezfs commented 9 years ago

@jollyjinx Do you mind running

pkgutil --pkgs | grep zfs | xargs -I "#" pkgutil --info "#" ; syslog -k Sender kernel | grep "Loaded"

Thanks

jollyjinx commented 9 years ago
MED>pkgutil --pkgs | grep zfs | xargs -I "#" pkgutil --info "#" ; syslog -k Sender kernel | grep "Loaded" 
package-id: com.getgreenbytes.community.zfs.pkg
version: 1.1.1
volume: /
location: System/Library/Filesystems
install-time: 1416306439
Nov 17 15:02:11 mediamac kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 17 15:05:52 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 17 15:05:52 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1576_g8887fc0, ZFS pool version 5000, ZFS filesystem version 5
Nov 17 15:12:27 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 17 15:12:27 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1576_g8887fc0, ZFS pool version 5000, ZFS filesystem version 5
Nov 18 09:32:27 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 18 09:32:27 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1576_g8887fc0, ZFS pool version 5000, ZFS filesystem version 5
Nov 18 10:10:40 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 18 10:10:40 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1576_g8887fc0, ZFS pool version 5000, ZFS filesystem version 5
Nov 18 10:25:30 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 18 10:25:30 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1580_gb7b4256, ZFS pool version 5000, ZFS filesystem version 5
Nov 18 11:00:09 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 18 11:00:09 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1580_gb7b4256, ZFS pool version 5000, ZFS filesystem version 5
Nov 19 09:00:42 mediamac kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 19 09:00:42 mediamac kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1580_gb7b4256, ZFS pool version 5000, ZFS filesystem version 5
Nov 19 09:03:12 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 19 09:03:12 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1580_gb7b4256, ZFS pool version 5000, ZFS filesystem version 5
Nov 19 09:41:54 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 19 09:41:54 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1580_gb7b4256, ZFS pool version 5000, ZFS filesystem version 5
Nov 19 09:49:48 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 19 09:49:48 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1580_gb7b4256, ZFS pool version 5000, ZFS filesystem version 5
Nov 19 10:00:45 localhost kernel[0] <Debug>: SPL: Loaded module v0.6.3-1, (ncpu 8, memsize 17179869184, pages 4194304)
Nov 19 10:00:45 localhost kernel[0] <Debug>: ZFS: Loaded module v0.6.3-1580_gb7b4256, ZFS pool version 5000, ZFS filesystem version 5
jollyjinx commented 9 years ago

With help from @ilovezfs I installed version 1.3.0 and now my tests ran through. I'm now trying to figure out which commit is working and which is not.

Update:

version from 2014.11.12 zfs commit a2b15d4dca6e5a696b9ea62f3ce052d99659d0f1 spl commit b8f3c23db598503e887ca0de6ebc7d1f358a79d6 does not work

version 2014.10.31 zfs commit ab864219efc390b33ed940b94a6b3abb46b684d3 spl commit 24a427353107f72e64ece200d4f1cd3d90a852c2 does not work

testing zfs commit 4e27df670c46a2479cd0b6d4737fa5dc6ad54dbd spl commit b80fd5982b7aa78329ab58a9efb17fb252043637 does not build

version 2014.10.30 zfs commit 4e27df670c46a2479cd0b6d4737fa5dc6ad54dbd spl commit a085265f52797060d5cc4b1f4d0fa1f0248b540d seems to work

version 2014.10.30 zfs commit 97cb0748c61eca0093b4ab71780fecaf0a80893c spl commit a085265f52797060d5cc4b1f4d0fa1f0248b540d seems to work

version 2014.10.28 zfs commit 5cca3c00639c8cb89c34a17ab2026e9fbf6a346f spl commit cc978ee798b333846d7449ed306685db4532441a seems to work

version 2014.10.27 zfs commit c4c44d6943e2206791025afa21a7a26c81ac24b2 spl commit 60ebe066bbbc587900cc6f8d3b5720a31486b945 seems to work

version from 2014.10.24 zfs 3bb43a397a5d0719d8826b1bcff3b7c076fd9108 spl 57dbdc36f2fb96199a2381118f2786b400365eb5 seems to work

lundman commented 9 years ago

https://github.com/openzfsonosx/zfs/commit/d68441ea8a7d

jollyjinx commented 9 years ago

zfs=d68441ea8a7db80f68551d9d5bc3079c9f19055e failed

spl=5cfc8d9177154291e7ef107bf74e6f23322f2fde

zfs=5312846af533f81c6791fd71a10f18167f55ddc0 failed

spl=5cfc8d9177154291e7ef107bf74e6f23322f2fde

zfs=4853eb956ab2bd6d9cea7b17996347d1e661de3b failed

spl=5cfc8d9177154291e7ef107bf74e6f23322f2fde

zfs=4e27df670c46a2479cd0b6d4737fa5dc6ad54dbd seems to work spl=5cfc8d9177154291e7ef107bf74e6f23322f2fde

jollyjinx commented 9 years ago

Out of curiosity I installed 10.10.1 on the machine (i created a diskimage backup beforehand of course ) and there issue 250 is not appearing. So it seems to be a 10.8 bug

there zfs=d68441ea8a7db80f68551d9d5bc3079c9f19055e spl=5cfc8d9177154291e7ef107bf74e6f23322f2fde

seems to work.

jollyjinx commented 9 years ago

I have installed a clean 10.8 with the build lundman sent me this morning ( to see if it was a compiler issue ), but the problem still exists. So it seems not to be a compiler issue.

lundman commented 9 years ago

Just to collect the findings so far, does not need AFP, needs to be 10.8. NMI showed panic was due to:

#7  0xffffff802c6b8549 in kernel_trap (state=0xffffff8042e3ba30, 
            lo_spp=0xffffff8042e3ba10) at 
            /SourceCache/xnu/xnu-2050.48.12/osfmk/i386/trap.c:705
#8  0xffffff802c6cebfd in trap_from_kernel () at 
            pmap_internal.h:1142
#9  0xffffff802c6b2519 in _bcopy () at pmap_internal.h:1142
#10 0xffffff802c971203 in uiomove64 (c_cp=<value temporarily 
            unavailable, due to optimizations>, n=<value temporarily 
            unavailable, due to optimizations>, uio=0xffffff8042e3ba20) at 
            /SourceCache/xnu/xnu-2050.48.12/bsd/kern/kern_subr.c:170
#11 0xffffff7fae37781f in dmu_read_uio (os=0xffffff824f7d1980, 
            object=183035, uio=0xffffff8042e3be70, size=393216) at dmu.c:1287
#12 0xffffff7fae43f722 in zfs_read (vp=0xffffff8054645c18, 
            uio=0xffffff8042e3be70, ioflag=0, cr=0xffffff804d42b560, 
            ct=0xffffff8042e3bf10) at zfs_vnops.c:762
#13 0xffffff7fae4503c8 in zfs_vnop_read (ap=0xffffff8042e3bdc0) at 
            zfs_vnops_osx.c:266
#14 0xffffff802c718f5d in VNOP_READ (vp=0xffffff8054645c18, 
            uio=0xffffff8042e3be70, ioflag=1048576, ctx=<value temporarily 
            unavailable, due to optimizations>) at 
            /SourceCache/xnu/xnu-2050.48.12/bsd/vfs/kpi_vfs.c:3688
#15 0xffffff802c70ef39 in vn_read (fp=0xffffff804e7800a8, 
            uio=<value temporarily unavailable, due to optimizations>, flags=0, 
            ctx=<value temporarily unavailable, due to optimizations>) at 
            /SourceCache/xnu/xnu-2050.48.12/bsd/vfs/vfs_vnops.c:928
#16 0xffffff802c97e9fe in dofileread (ctx=0xffffff8042e3bf10, 
            fp=0xffffff804e7800a8, bufp=4453113856, nbyte=10485760, 
            offset=<value temporarily unavailable, due to optimizations>, 
            flags=0) at 
            /SourceCache/xnu/xnu-2050.48.12/bsd/kern/sys_generic.c:371
#17 0xffffff802c97e822 in read_nocancel (p=0xffffff8050374460, 
            uap=0xffffff8050a0d9d4, retval=0xffffff8050a062a8) at 
            /SourceCache/xnu/xnu-2050.48.12/bsd/kern/sys_generic.c:216
#18 0xffffff802c9e97ba in unix_syscall64 (state=0xffffff8050a0d9d0) 
            at /SourceCache/xnu/xnu-2050.48.12/bsd/dev/i386/systemcalls.c:384
gdb) p *uio
$3 = {
 uio_iovs = {
  kiovp = 0xffffff8042e3bea8,
   uiovp = 0xffffff8042e3bea8
  },
  uio_iovcnt = 1,
  uio_offset = 500826112,
  uio_segflg = UIO_USERSPACE64,
 uio_rw = UIO_READ,
  uio_resid_64 = 2490368,
 uio_size = 72,
 uio_max_iovs = 1,
 uio_flags = 1
}
 size=393216
(gdb) p nbytes
$4 = 1048576

(gdb) p n
$5 = 3145728

(gdb) p zfs_read_chunk_size
$6 = 1048576
jollyjinx commented 9 years ago

It's a 16gb machine. After setting the arc max size to 8 GB the machine hang as well, but after setting the machine to 2 GB arc max size the machine ran through the test multiple times.

jollyjinx commented 9 years ago

I upgraded the pool to see if that has an effect. So on 10.8 with 1.31rc1 and the upgraded pool it hangs. Even though it's using less ARC2. Then I disabled ZIL and ARC2 - still hangs, but later.

lundman commented 9 years ago

I believe we fought this valiantly and slayed it, took selfies, etc. Please reopen if it gets resurrected.