openzfsonosx / zfs

OpenZFS on OS X
https://openzfsonosx.org/
Other
824 stars 72 forks source link

zfs_util crashes on restart #776

Open jawbroken opened 4 years ago

jawbroken commented 4 years ago

I have OpenZFS_on_OS_X_1.9.4 2020-02-29 installed on macOS Catalina 10.15.6. I'm getting crashes in zfs_util on every restart. They always seem to come in pairs, as follows.

Example pair:

zfs_util_2020-09-22-232650_fileserver.crash

Process:               zfs_util [622]
Path:                  /Library/Filesystems/zfs.fs/Contents/Resources/zfs_util
Identifier:            zfs_util
Version:               0
Code Type:             X86-64 (Native)
Parent Process:        diskarbitrationd [195]
Responsible:           diskarbitrationd [195]
User ID:               0

Date/Time:             2020-09-22 23:26:50.074 +1000
OS Version:            Mac OS X 10.15.6 (19G2021)
Report Version:        12
Anonymous UUID:        77D92FC9-1CD8-9252-E068-3F13655407BE

Time Awake Since Boot: 55 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Segmentation fault: 11
Termination Reason:    Namespace SIGNAL, Code 0xb
Terminating Process:   exc handler [622]

VM Regions Near 0:
--> 
    __TEXT                 000000010c369000-000000010c53e000 [ 1876K] r-x/r-x SM=COW  /Library/Filesystems/zfs.fs/Contents/Resources/zfs_util

Application Specific Information:
dyld2 mode

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_platform.dylib        0x00007fff69164e52 _platform_strlen + 18
1   libsystem_platform.dylib        0x00007fff69165718 _platform_strlcpy + 28
2   libsystem_c.dylib               0x00007fff6903b279 __strlcpy_chk + 30
3   zfs_util                        0x000000010c36b403 main + 2139
4   libdyld.dylib                   0x00007fff68f6ecc9 start + 1

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x0000000000000000  rbx: 0x0000000000000080  rcx: 0x0000000000000000  rdx: 0x0000000000000000
  rdi: 0x0000000000000000  rsi: 0x0000000000000000  rbp: 0x00007ffee3895b30  rsp: 0x00007ffee3895b30
   r8: 0x000000000000065d   r9: 0x0000000000000022  r10: 0x00007ffee3896d90  r11: 0x00007fff6903b25b
  r12: 0x0000000000000080  r13: 0x0000000000000006  r14: 0x00007fa90740a340  r15: 0x0000000000000000
  rip: 0x00007fff69164e52  rfl: 0x0000000000010246  cr2: 0x0000000000000000

Logical CPU:     1
Error Code:      0x00000004 (no mapping for user data read)
Trap Number:     14

zfs_util_2020-09-22-232739_tbfileserver.crash

Process:               zfs_util [821]
Path:                  /Library/Filesystems/zfs.fs/Contents/Resources/zfs_util
Identifier:            zfs_util
Version:               0
Code Type:             X86-64 (Native)
Parent Process:        diskarbitrationd [195]
Responsible:           diskarbitrationd [195]
User ID:               0

Date/Time:             2020-09-22 23:27:14.596 +1000
OS Version:            Mac OS X 10.15.6 (19G2021)
Report Version:        12
Anonymous UUID:        77D92FC9-1CD8-9252-E068-3F13655407BE

Time Awake Since Boot: 80 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Note:        EXC_CORPSE_NOTIFY

Termination Signal:    Segmentation fault: 11
Termination Reason:    Namespace SIGNAL, Code 0xb
Terminating Process:   exc handler [821]

VM Regions Near 0:
--> 
    __TEXT                 0000000106275000-000000010644a000 [ 1876K] r-x/r-x SM=COW  /Library/Filesystems/zfs.fs/Contents/Resources/zfs_util

Application Specific Information:
dyld2 mode

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_platform.dylib        0x00007fff69164e52 _platform_strlen + 18
1   libsystem_platform.dylib        0x00007fff69165718 _platform_strlcpy + 28
2   libsystem_c.dylib               0x00007fff6903b279 __strlcpy_chk + 30
3   zfs_util                        0x0000000106277403 main + 2139
4   libdyld.dylib                   0x00007fff68f6ecc9 start + 1

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x0000000000000000  rbx: 0x0000000000000080  rcx: 0x0000000000000000  rdx: 0x0000000000000000
  rdi: 0x0000000000000000  rsi: 0x0000000000000000  rbp: 0x00007ffee9989b30  rsp: 0x00007ffee9989b30
   r8: 0x0000000000000043   r9: 0x0000000000000024  r10: 0x00007ffee998ad90  r11: 0x00007fff6903b25b
  r12: 0x0000000000000080  r13: 0x0000000000000007  r14: 0x00007f8a64504080  r15: 0x0000000000000000
  rip: 0x00007fff69164e52  rfl: 0x0000000000010246  cr2: 0x0000000000000000

Logical CPU:     2
Error Code:      0x00000004 (no mapping for user data read)
Trap Number:     14

I only see two uses of strlcpy in zfsutil, here and here. The latter is perhaps more supicious, being introduced in this commit from April 2019 and mentioning diskarbitrationd, etc.

lundman commented 4 years ago

Seems related to:

# lldb /Library/Filesystems/zfs.fs/Contents/Resources/zfs_util

(lldb) image lookup --verbose --address main+2139
      Address: zfs_util[0x0000000100002403] (zfs_util.__TEXT.__text + 4419)
      Summary: zfs_util`main + 2139 [inlined] zfs_probe_iokit + 40 at zfs_util.c:733
               zfs_util`main + 2099 at zfs_util.c:733
       Module: file = "/Library/Filesystems/zfs.fs/Contents/Resources/zfs_util", arch = "x86_64"
  CompileUnit: id = {0x00000000}, file = "/Users/lundman/Developer/catalina/zfs/cmd/zfs_util/zfs_util.c", language = "c99"
     Function: id = {0x100000aec}, name = "main", range = [0x0000000100001ba8-0x0000000100002458)
     FuncType: id = {0x100000aec}, byte-size = 0, decl = zfs_util.c:609, compiler_type = "int (int, char **)"
       Blocks: id = {0x100000aec}, range = [0x100001ba8-0x100002458)
               id = {0x100000c86}, ranges = [0x100001e75-0x100001e96)[0x100001fe5-0x10000200f)[0x100002110-0x1000021f9)[0x1000023db-0x100002438), name = "zfs_probe_iokit", decl = zfs_util.c:418
    LineEntry: [0x0000000100002403-0x000000010000240b): /Users/lundman/Developer/catalina/zfs/cmd/zfs_util/zfs_util.c:475:3
       Symbol: id = {0x000009e8}, range = [0x0000000100001ba8-0x0000000100002484), name="main"
     Variable: id = {0x100000c92}, name = "devpath", type = "const char *", location = , decl = zfs_util.c:418
     Variable: id = {0x100000c97}, name = "args", type = "probe_args_t *", location = , decl = zfs_util.c:418
     Variable: id = {0x100000c9c}, name = "result", type = "int", location = DW_OP_consts(0x-1) , decl = zfs_util.c:424
     Variable: id = {0x100000ca5}, name = "cfstr", type = "CFStringRef", location = r15, decl = zfs_util.c:423
     Variable: id = {0x100000cae}, name = "service", type = "io_service_t", location = r14, decl = zfs_util.c:422
     Variable: id = {0x100000b26}, name = "blockdevice", type = "char [1024]", location = DW_OP_fbreg(-3200), decl = zfs_util.c:612
     Variable: id = {0x100000b36}, name = "rawdevice", type = "char [1024]", location = DW_OP_fbreg(-2176), decl = zfs_util.c:613
     Variable: id = {0x100000b46}, name = "volname", type = "char [1024]", location = DW_OP_fbreg(-4368), decl = zfs_util.c:626
     Variable: id = {0x100000b56}, name = "pool_name", type = "char *", location = rbx, decl = zfs_util.c:627
     Variable: id = {0x100000b86}, name = "what", type = "char", location = , decl = zfs_util.c:614
     Variable: id = {0x100000bb2}, name = "i", type = "int", location = r13, decl = zfs_util.c:623
     Variable: id = {0x100000bc2}, name = "sb", type = "stat", location = [rbp-3344], decl = zfs_util.c:618
     Variable: id = {0x100000bf2}, name = "len", type = "int", location = DW_OP_consts(0x256) , decl = zfs_util.c:623
     Variable: id = {0x100000c12}, name = "attr", type = "attrlist", location = , decl = zfs_util.c:624
     Variable: id = {0x100000c1e}, name = "nameBuf", type = "attrNameBuf", location = , decl = zfs_util.c:625
                cfstr = IORegistryEntryCreateCFProperty(service,
                        CFSTR("ZFS Dataset"), kCFAllocatorDefault, 0);
                if (cfstr != NULL) {
                        (void) strlcpy(args->pool_name,
                                CFStringGetCStringPtr(cfstr,
                                        kCFStringEncodingMacRoman),
                                sizeof (io_name_t));

Makes me wonder what string is in your "ZFS Dataset". Ie, can you run: # ioreg -al and search for "ZFS Dataset". My test pool194 has:

   <key>ZFS Dataset</key>
                            <string>pool194</string>
jawbroken commented 4 years ago

I have two pools, so I get two results that look benign:

<key>ZFS Dataset</key>
<string>lstorage</string>
…
<key>ZFS Dataset</key>
<string>tbstorage</string>
rejsmont commented 4 years ago

I am experiencing exactly the same crashes (same symbols), and additionally, random reboots upon high load when the ZFS pool is imported. I did not find any kernel panic dumps related to these reboots. Not sure if it's related, if you think it's not, I can submit a separate issue.

# uname -a
Darwin SFLab-MacBookPro.local 19.6.0 Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64 x86_64

# zfs --version
zfs-1.9.3-0
zfs-kmod-1.9.3-0

Cheers!

lundman commented 4 years ago

I tightened up that code and hopefully it will do better. It will be in the v2.0 version, but you could extract the zfs_util and give it a try.

jawbroken commented 4 years ago

I am experiencing exactly the same crashes (same symbols), and additionally, random reboots upon high load when the ZFS pool is imported. I did not find any kernel panic dumps related to these reboots. Not sure if it's related, if you think it's not, I can submit a separate issue.

I was experiencing similar random reboots which is what led me to look into and report this issue, but I'm not sure the two things are related (I didn't initially realise the zfs_util crashes were happening after restart). I was also seeing mds crashes, but I'm not sure whether that's relevant either. So I have disabled all mds activity using mdutil, which also meant having to unmount my HFS+ Time Machine drive (you can't disable mds on a Time Machine drive), and haven't had a crash in the last 22 days. When I find time I'll start re-enabling things and remounting that drive and try to isolate the issue, because it's unclear if it's ZFS related at all.

Edit: Though this part of the 1.9.4 changelog made me suspect that the mds crashes could theoretically be related: “Reserved inode numbers avoided (Catalina, spotlight might need to reindex)”

eejd commented 3 years ago

I have been experiencing what appears to be the same kernel panic caused by what appears to be the same zfs_util bad memory access. Do you need more logs? Should we try using the 2.0 RC already or is there a branch with this fix backported?

jawbroken commented 3 years ago

Are you sure the kernel panics are being caused by zfs_util? I originally had similar suspicions until i realised the zfs_util crashes were happening after reboot, not before, and therefore appear to be relatively benign. See “Time Awake Since Boot: 55 seconds” in my above logs.