double-fault panic (zfs_reclaim) (applies to deprecated build)

GoogleCodeExporter commented 8 years ago

It's not entirely clear what triggers this. I've seen this panic in I/O from a 
wide variety of applications, not many of which seem all that intensive. It 
happened very frequently for a while, then stopped for about a month. It's 
happened several times in the last few days, so I reckoned it was time to file  
a bug report.

Here's a few recent panic analyses:

0xffffff8000204b99 <panic+608>: mov    0x41e6b5(%rip),%esi        # 
0xffffff8000623254 <panic_is_inited>
0xffffff80002cf17f <panic_64+350>:      add    $0xc8,%rsp
0xffffff80002e29ef <hndl_double_fault+15>:      mov    %r12,%rsp
0xffffff7f8185765c <dbuf_hold+27>:      add    %al,(%rax)
0xffffff7f81862134 <dnode_hold_impl+137>:       jne    0xffffff7f81862126 
<dnode_hold_impl+123>
0xffffff7f818597d6 <dmu_buf_hold+52>:   adc    %al,-0x74f18b0a(%rbp)
0xffffff7f81882b51 <zap_lockdir+78>:    test   %ah,(%rax,%rdi,4)
0xffffff7f81882f0f <zap_cursor_retrieve+104>:   lea    -0x40(%rbp),%edx
0xffffff7f81886314 <zfs_rmnode+503>:    add    %eax,(%rax)
0xffffff7f81893483 <zfs_zinactive+100>: push   %r12
0xffffff7f8188dd1c <zfs_vnop_reclaim+145>:      mov    0x28(%rdi),%rdi
0xffffff800023d35b <VNOP_RECLAIM+44>:   leaveq 
0xffffff80002379d7 <vclean+384>:        test   %eax,%eax
0xffffff8000237b59 <vnode_reclaim_internal+197>:        movzwl 0x70(%r12),%eax
0xffffff80002f9e54 <vnode_create+854>:  cmpq   $0xdeadb,0x20(%r12)
0xffffff7f81893cbe <zfs_attach_vnode+352>:      std    
0xffffff7f81894006 <zfs_zget_internal+722>:     mov    $0x8c,%bl
0xffffff7f81886352 <zfs_rmnode+565>:    (bad)  
0xffffff7f81893483 <zfs_zinactive+100>: push   %r12
0xffffff7f8188dd1c <zfs_vnop_reclaim+145>:      mov    0x28(%rdi),%rdi
0xffffff800023d35b <VNOP_RECLAIM+44>:   leaveq 
0xffffff80002379d7 <vclean+384>:        test   %eax,%eax
0xffffff8000237b59 <vnode_reclaim_internal+197>:        movzwl 0x70(%r12),%eax
0xffffff80002f9e54 <vnode_create+854>:  cmpq   $0xdeadb,0x20(%r12)
0xffffff7f81893cbe <zfs_attach_vnode+352>:      std    
0xffffff7f81894006 <zfs_zget_internal+722>:     mov    $0x8c,%bl
0xffffff7f81886352 <zfs_rmnode+565>:    (bad)  
0xffffff7f81893483 <zfs_zinactive+100>: push   %r12
0xffffff7f8188dd1c <zfs_vnop_reclaim+145>:      mov    0x28(%rdi),%rdi
0xffffff800023d35b <VNOP_RECLAIM+44>:   leaveq 
0xffffff80002379d7 <vclean+384>:        test   %eax,%eax
0xffffff8000237b59 <vnode_reclaim_internal+197>:        movzwl 0x70(%r12),%eax

0xffffff8000204b99 <panic+608>: mov    0x41e6b5(%rip),%esi        # 
0xffffff8000623254 <panic_is_inited>
0xffffff80002cf17f <panic_64+350>:      add    $0xc8,%rsp
0xffffff80002e29ef <hndl_double_fault+15>:      mov    %r12,%rsp
0xffffff8000210c4a <memory_object_lock_request+222>:    movzwl 0x74(%rbx),%eax
0xffffff800025043c <ubc_msync_internal+157>:    test   %eax,%eax
0xffffff8000237958 <vclean+257>:        mov    %r14d,%eax
0xffffff8000237b59 <vnode_reclaim_internal+197>:        movzwl 0x70(%r12),%eax
0xffffff80002f9e54 <vnode_create+854>:  cmpq   $0xdeadb,0x20(%r12)
0xffffff7f80a0dcbe <zfs_attach_vnode+352>:      std    
0xffffff7f80a0e006 <zfs_zget_internal+722>:     mov    $0x8c,%bl
0xffffff7f80a00352 <zfs_rmnode+565>:    (bad)  
0xffffff7f80a0d483 <zfs_zinactive+100>: push   %r12
0xffffff7f80a07d1c <zfs_vnop_reclaim+145>:      mov    0x28(%rdi),%rdi
0xffffff800023d35b <VNOP_RECLAIM+44>:   leaveq 
0xffffff80002379d7 <vclean+384>:        test   %eax,%eax
0xffffff8000237b59 <vnode_reclaim_internal+197>:        movzwl 0x70(%r12),%eax
0xffffff8000237e54 <vnode_put_locked+161>:      mov    %rbx,%rdi
0xffffff8000237e8c <vnode_put+33>:      mov    %eax,%r12d
0xffffff8000234aeb <vnode_update_identity+801>: mov    0x1d0(%rbx),%rdi
0xffffff8000237a00 <vnode_lock>:        mov    %r12,%rdi
0xffffff8000237b59 <vnode_reclaim_internal+197>:        movzwl 0x70(%r12),%eax
0xffffff80002f9e54 <vnode_create+854>:  cmpq   $0xdeadb,0x20(%r12)
0xffffff7f80a0dcbe <zfs_attach_vnode+352>:      std    
0xffffff7f80a0e006 <zfs_zget_internal+722>:     mov    $0x8c,%bl
0xffffff7f80a00352 <zfs_rmnode+565>:    (bad)  
0xffffff7f80a0d483 <zfs_zinactive+100>: push   %r12
0xffffff7f80a07d1c <zfs_vnop_reclaim+145>:      mov    0x28(%rdi),%rdi
0xffffff800023d35b <VNOP_RECLAIM+44>:   leaveq 
0xffffff80002379d7 <vclean+384>:        test   %eax,%eax
0xffffff8000237b59 <vnode_reclaim_internal+197>:        movzwl 0x70(%r12),%eax
0xffffff8000237e54 <vnode_put_locked+161>:      mov    %rbx,%rdi
0xffffff8000237e8c <vnode_put+33>:      mov    %eax,%r12d

What version of the product are you using? On what operating system?
I'm using 74.0.1 on 10.6.5.

Please provide any additional information below.

Original issue reported on code.google.com by buffer.g...@gmail.com on 6 Feb 2011 at 2:36

GoogleCodeExporter commented 8 years ago

I'm on the road without access to the original logs, but after doing some 
preliminary research on this, my hypothesis is that this is a kernel stack 
overflow caused by recursion in the vnode_reclaim code. I haven't been able to 
confirm what the reclaim is for, but I'm assuming it's a form of garbage 
collection or cache cleaning in a filesystem's UBC. I'll check in with the list 
on this.

Original comment by buffer.g...@gmail.com on 16 Feb 2011 at 8:11

GoogleCodeExporter commented 8 years ago

The ZFS reclaim (or more generally, reclaim in OSX vnode parlance) is the 
cleanup of any FS-specific resources associated with the OSX vnode. It 
typically happens after a file is deleted. Whereas in the upstream Solaris 
code, which cleans up the resources upon delete, in OSX the clean is split into 
two parts; cleaning up the ZFS data and then cleaning up the VNode.

I know I need to look into this some more because it's the same issue with the 
MacZFS 77 code. I'm also not sure if we might be better off with ditching the 
zfs_reclaim stuff and just having the code in the zfs_delete node.

You might find that whilst this doesn't happen under high load all the time 
that it happens when many deletions are taking place. Creating or updating 
files is less likely to trigger this panic.

Original comment by alex.ble...@gmail.com on 16 Feb 2011 at 9:32

Changed title: double-fault panic (zfs_reclaim)

GoogleCodeExporter commented 8 years ago

I'm finding that this doesn't seem to happen in connection with file deletions, 
with the notable exception of crashes caused by htclean (the cache cleaning 
process for Apache running as a proxy server). I can't be sure, since we don't 
seem to get which system call was made by the application identified as the 
source of the panic, but some of these applications don't appear to be doing 
anything more than scanning directories or mount points for initialising an 
installer (installd) or present a file selection dialogue.

Apple's suggestion to resolve these kinds of issues is to replace recursive 
algorithms with iterative ones. I've only scanned the code thus far and have no 
clear idea whether this is practical. I'd love to increase the stack size as a 
workaround, but as far as I can work out, this is non-trivial, as the sysctl 
kern.stack_size setting is read-only. There's not a lot of documentation on 
this variable, but it appears to require kernel recompilation.

Original comment by buffer.g...@gmail.com on 2 Mar 2011 at 8:46

GoogleCodeExporter commented 8 years ago

Original comment by alex.ble...@gmail.com on 19 Mar 2011 at 8:31

Changed title: double-fault panic (zfs_reclaim) (applies to deprecated build)
Changed state: WontFix

gmzang / maczfs

double-fault panic (zfs_reclaim) (applies to deprecated build) #80