race condition / deadlock showing up at dnode_special_close

ilovezfs commented 10 years ago

 ssh o3x.org zfs send -R o3xpool@1396051829 | sudo zfs recv -F o3xpool

http://bpaste.net/show/196813/

dch commented 10 years ago

FWIW I do these most days, send from OSX to both OSX & FreeBSD servers via ssh. The receiving datasets are all readonly though.

e.g. zfs send -vR tub/shared/projects@20140331 | ssh wintermute sudo zfs recv -Fuv zroot/zfs/shared/projects and similar.

lundman commented 10 years ago

The panic is actually because the code that detects the old "dnode: ARC release bug triggered: %p (%lld)-- sorry" would forcefully decrease the hung dnode's counter, until it reached 0 and continue. Causing panic elsewhere.

Looking at why we are deadlocked though;

The dnode itself

(lldb) p *db
(dmu_buf_impl_t) $2 = {
  db = (db_object = 0, db_offset = 0, db_size = 16384, db_data = 0xffffff808e90f098)
  db_objset = 0xffffff80915de848
  db_dnode_handle = 0xffffff80915de868
  db_parent = 0xffffff8090da05f0
  db_hash_next = 0x0000000000000000
  db_blkid = 0
  db_blkptr = 0xffffff808e97b320
  db_level = '\0'
  db_mtx = {
    m_owner = 0xffffff8015424b20
    m_lock = 0xffffff8021ad9170
    m_padding = ([0] = '\0', [1] = '\0', [2] = '\0', [3] = '\0', [4] = '\0', [5] = '\0', [6] = '\0', [7] = '\0', [8] = '\0', [9] = '\0', [10] = '\0', [11] = '\0', [12] = '\0', [13] = '\0')
  }
  db_state = DB_CACHED
  db_holds = (rc_count = 0)
  db_buf = 0xffffff809a53b3e0
  db_changed = (pad = 0, pad2 = 0)
  db_data_pending = 0x0000000000000000
  db_last_dirty = 0x0000000000000000
  db_link = {
    list_next = 0xffffff8090da0698
    list_prev = 0xffffff808ec74ce0
  }
  db_user_ptr = 0xffffff80935c7b38
  db_user_data_ptr_ptr = 0x0000000000000000
  db_evict_func = 0xffffff7f8919f6e0 (zfs`dnode_buf_pageout at dnode.c:977)
  db_immediate_evict = '\0'
  db_freed_in_flight = '\0'
  db_dirtycnt = '\0'
}

threads from spindump after we have deadlocked;

                                                 *993 VFS_ROOT + 198 (mach_kernel) [0xffffff80003fa1a6]
                                                   *993 zfs_vfs_root + 63 (zfs) [0xffffff7f80e352ff]
                                                     *993 rrw_enter + 45 (zfs) [0xffffff7f80dc8acd]
                                                       *993 rrw_enter_read + 165 (zfs) [0xffffff7f80dc87f5]
                                                         *993 spl_cv_wait + 51 (spl) [0xffffff7f80d524a3]

                                             *993 vnode_getattr + 119 (mach_kern
el) [0xffffff80003fbe67]
                                               *993 zfs_vnop_getattr + 75 (zfs) [0xffffff7f80e44f2b]
                                                 *993 zfs_getattr + 147 (zfs) [0xffffff7f80e3d5f3]
                                                   *993 rrw_enter + 45 (zfs) [0xffffff7f80dc8acd]
                                                     *993 rrw_enter_read + 165 (zfs) [0xffffff7f80dc87f5]
                                                       *993 spl_cv_wait + 51 (spl) [0xffffff7f80d524a3]

                                           *993 vnode_getattr + 119 (mach_kernel
) [0xffffff80003fbe67]
                                             *993 zfs_vnop_getattr + 75 (zfs) [0xffffff7f80e44f2b]
                                               *993 zfs_getattr + 147 (zfs) [0xffffff7f80e3d5f3]
                                                 *993 rrw_enter + 45 (zfs) [0xffffff7f80dc8acd]
                                                   *993 rrw_enter_read + 165 (zfs) [0xffffff7f80dc87f5]
                                                     *993 spl_cv_wait + 51 (spl) [0xffffff7f80d524a3]

                                                             *993 VFS_ROOT + 198
 (mach_kernel) [0xffffff80003fa1a6]
                                                               *993 zfs_vfs_root + 63 (zfs) [0xffffff7f80e352ff]
                                                                 *993 rrw_enter + 45 (zfs) [0xffffff7f80dc8acd]
                                                                   *993 rrw_enter_read + 165 (zfs) [0xffffff7f80dc87f5]
                                                                     *993 spl_cv_wait + 51 (spl) [0xffffff7f80d524a3]

                                   *993 zfsdev_ioctl + 1260 (zfs) [0xffffff7f80e20d8c]
                                     *993 zfs_ioc_recv + 1470 (zfs) [0xffffff7f80e2389e]
                                       *993 dmu_recv_end + 71 (zfs) [0xffffff7f80d8dc57]
                                         *993 dmu_recv_existing_end + 124 (zfs) [0xffffff7f80d8dd8c]
                                           *993 dsl_sync_task + 567 (zfs) [0xffffff7f80db8e47]
                                             *993 txg_wait_synced + 261 (zfs) [0xffffff7f80dea235]
                                               *993 spl_cv_wait + 51 (spl) [0xffffff7f80d524a3]

   *993 txg_sync_thread + 962 (zfs) [0xffffff7f80de9f12]
     *993 spa_sync + 1101 (zfs) [0xffffff7f80dd9bad]
       *993 dsl_pool_sync + 1039 (zfs) [0xffffff7f80daf9cf]
         *993 dsl_sync_task_sync + 300 (zfs) [0xffffff7f80db90dc]
           *993 dmu_recv_end_sync + 421 (zfs) [0xffffff7f80d8e1d5]
             *993 dsl_dataset_clone_swap_sync_impl + 159 (zfs) [0xffffff7f80da2f7f]
               *993 dmu_objset_evict + 773 (zfs) [0xffffff7f80d85415]
                 *993 dnode_special_close + 62 (zfs) [0xffffff7f80d98c6e]
                   *993 delay_for_interval + 39 (mach_kernel) [0xffffff8000222197]

 *993 call_continuation + 23 (mach_kernel) [0xffffff80002d6ff7]
   *993 vnop_reclaim_thread + 238 (zfs) [0xffffff7f80e4460e]
     *993 rw_enter + 23 (spl) [0xffffff7f80d54127]
       *993 lck_rw_lock_shared_gen + 118 (mach_kernel) [0xffffff80002d5156]
         *993 thread_block_reason + 204 (mach_kernel) [0xffffff8000235d8c]

It is interesting to note that even reclaim thread is stuck on the rw_enter lock.

The code/locks involved are, top 4 are waiting on

    ZFS_ENTER(zfsvfs);
#define ZFS_ENTER_NOERROR(zfsvfs)                                       \
        rrw_enter(&(zfsvfs)->z_teardown_lock, RW_READER, FTAG)

and reclaim thread is waiting on

            rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);

lundman commented 10 years ago

Ok, the recv end goes through this block of code

            error = zfs_suspend_fs(zsb);
            end_err = dmu_recv_end(&drc, zsb);
            if (error == 0)
                error = zfs_resume_fs(zsb, tofs);

Where zfs_suspend_fs() holds BOTH rw_lock and rrw_lock. zfs_resume_fs() releases the locks again. However, the thread is stuck in dmu_recv_end() due to

                                       *994 dmu_recv_end + 71 (zfs) [0xffffff7f80d8fc57]
                                         *994 dmu_recv_existing_end + 124 (zfs) [0xffffff7f80d8fd8c]
                                           *994 dsl_sync_task + 567 (zfs) [0xffffff7f80dbae47]
                                             *994 txg_wait_synced + 261 (zfs) [0xffffff7f80dec235]
                                               *994 spl_cv_wait + 51 (spl) [0xffffff7f80d544a3]

waiting for dmu_txg sync.

But the sync thread is stuck in

   *994 txg_sync_thread + 962 (zfs) [0xffffff7f80debf12]
     *994 spa_sync + 1101 (zfs) [0xffffff7f80ddbbad]
       *994 dsl_pool_sync + 1039 (zfs) [0xffffff7f80db19cf]
         *994 dsl_sync_task_sync + 300 (zfs) [0xffffff7f80dbb0dc]
           *994 dmu_recv_end_sync + 421 (zfs) [0xffffff7f80d901d5]
             *994 dsl_dataset_clone_swap_sync_impl + 159 (zfs) [0xffffff7f80da4f7f]
               *994 dmu_objset_evict + 773 (zfs) [0xffffff7f80d87415]
                 *994 dnode_special_close + 62 (zfs) [0xffffff7f80d9ac6e]

lundman commented 10 years ago

https://github.com/openzfsonosx/zfs/commit/459a09bcab967c6b3b71d62169675ecd1db97fb4

ilovezfs commented 10 years ago

Appears to still be broken. Now it freezes up.

https://www.dropbox.com/s/25csa6h4vb085y2/spindump.txt

ilovezfs commented 10 years ago

Still broken https://www.dropbox.com/s/db24fa8cekie2e1/spindump-1397097200.txt

lundman commented 7 years ago

Looking at the code:

getzfsvfs(const char *dsname, zfsvfs_t **zfvp)
{
...
        *zfvp = dmu_objset_get_user(os);
        if (*zfvp) {
                VFS_HOLD((*zfvp)->z_vfs);

and uses of it, like that of zfs_ioc_recv:

    if (getzfsvfs(tofs, &zfsvfs) == 0) {
        ds = dmu_objset_ds(zfsvfs->z_os);
        error = zfs_suspend_fs(zfsvfs);
        VFS_RELE(zfsvfs->z_vfs);

So it appears that getzfsvfs() will grab the zfsvfs while under dmu_objset_hold() lock, then lock the VFS by calling VFS_HOLD(). Once we are done with zfsvfs it calls VFS_RELE().

On O3X, both VFS_HOLD and VFS_RELE are empty defines, as we do not have anything directly like IllumOS versions.

Even though the IllumOS comment for VFS_HOLD sounds somewhat benign, it would appear that it will stop unmounts from happening. Whereas for us, unmount can happen, and z_os is released (possibly reused by something else) so our zfsvfs will be over-written by random garbage (as well as us using zfsvfs after it was freed).

Our closest call to VFS_HOLD appears to be vfs_busy, but there are some implementational differences.

IllumOS VFS_HOLD is an atomic operation, and can therefore be called repeatedly. A good example of that is zfs_vfsops.c calling it, and calls zfsctl_create which also call it from gfs.c, which in turn calls getroot and we end up in zfs_znode which calls it a third time.

On OSX you can only call it once, any nesting calls will deadlock.

A second issue is that even when calling vfs_busy( , LK_NOWAIT) it only detects when unmount is happening, and returns immediately. It does not include mounting, so we can not call vfs_busy from within the zfs_vfs_mount() context. (which is where zfsctl_create() is call as above.)

So we can add #ifdef APPLE vfs_busy() calls around the functions that call getzfsvfs() (there are 5). Or we can attempt to make VFS_HOLD, use mount_fsprivate() to get the zfsvfs ptr, and store an atomic in the struct. And upon reaching 1 (or 0) in VFS_HOLD (VFS_RELE respectively) call vfs_busy (vfs_unbusy). With the additional hack that zfs_vfs_mount() creates zfsvfs with the atomic already at 1 since we may not call vfs_busy at this time (nor do we need to, it is locked to call us).

JMoVS commented 5 years ago

@lundman Or is this still an issue?

openzfsonosx / zfs

race condition / deadlock showing up at dnode_special_close #137