Continued deadlocks following kmem-rework merge

dweeezil commented 9 years ago

As outlined in the kmem-rework merge, and follow-up commits such as 4ec15b8, there are and may be additional code paths within ZFS which require inhibiting filesystem re-entry during reclaim.

I've identified several conditions under which a system can be deadlocked rather easily. The easiest instance to reproduce involves the interaction beteween zfs_zinactive() and zfs_zget() due to the manner in which the z_hold_mtx[] mutex array is used. zfs_zinactive() has the following code to try to prevent deadlocks:

         * Linux allows direct memory reclaim which means that any KM_SLEEP
         * allocation may trigger inode eviction.  This can lead to a deadlock
         * through the ->shrink_icache_memory()->evict()->zfs_inactive()->
         * zfs_zinactive() call path.  To avoid this deadlock the process
         * must not reacquire the mutex when it is already holding it.
         */
        if (!ZFS_OBJ_HOLD_OWNED(zsb, z_id)) {
                ZFS_OBJ_HOLD_ENTER(zsb, z_id);

I've produced a patch which flattens the z_hold_mtx[] array from a 256-element hash to a series of 4 consecutive mutex (also used in a hash-like manner) in order to get better diagnostics from Linux's lock debugging facilities. Here's an example from one of this class of deadlocks:

[  161.673411] 9 locks held by create.sh/13271:
[  161.673412]  #0:  (&type->i_mutex_dir_key#4){......}, at: [<ffffffff81738ac3>] lookup_slow+0x35/0xa9
[  161.673417]  #1:  (&zp->z_name_lock){......}, at: [<ffffffffa05fee04>] zfs_dirent_lock+0x464/0xab0 [zfs]
[  161.673438]  #2:  (&zsb->z_hold_mtx_0){......}, at: [<ffffffffa062b811>] zfs_zget+0xa1/0x9b0 [zfs]
[  161.673459]  #3:  (&dn->dn_struct_rwlock){......}, at: [<ffffffffa054a9f9>] dbuf_read+0x409/0x11d0 [zfs]
[  161.673472]  #4:  (&db->db_mtx){......}, at: [<ffffffffa054a69c>] dbuf_read+0xac/0x11d0 [zfs]
[  161.673483]  #5:  (shrinker_rwsem){......}, at: [<ffffffff8116a59f>] shrink_slab+0x3f/0x370
[  161.673486]  #6:  (&type->s_umount_key#42){......}, at: [<ffffffff811cbed4>] grab_super_passive+0x44/0x90
[  161.673490]  #7:  (&zsb->z_teardown_inactive_lock){......}, at: [<ffffffffa061e9b0>] zfs_inactive+0x40/0x480 [zfs]
[  161.673510]  #8:  (&zsb->z_hold_mtx_2){......}, at: [<ffffffffa062d6a3>] zfs_zinactive+0x233/0x660 [zfs]

[  161.673687] 7 locks held by create.sh/13279:
[  161.673688]  #0:  (&type->i_mutex_dir_key#4){......}, at: [<ffffffff81738ac3>] lookup_slow+0x35/0xa9
[  161.673692]  #1:  (&zp->z_name_lock){......}, at: [<ffffffffa05fee04>] zfs_dirent_lock+0x464/0xab0 [zfs]
[  161.673712]  #2:  (&zsb->z_hold_mtx_2){......}, at: [<ffffffffa062ba29>] zfs_zget+0x2b9/0x9b0 [zfs]
[  161.673731]  #3:  (shrinker_rwsem){......}, at: [<ffffffff8116a59f>] shrink_slab+0x3f/0x370
[  161.673734]  #4:  (&type->s_umount_key#42){......}, at: [<ffffffff811cbed4>] grab_super_passive+0x44/0x90
[  161.673738]  #5:  (&zsb->z_teardown_inactive_lock){......}, at: [<ffffffffa061e9b0>] zfs_inactive+0x40/0x480 [zfs]
[  161.673758]  #6:  (&zsb->z_hold_mtx_0){......}, at: [<ffffffffa062d51e>] zfs_zinactive+0xae/0x660 [zfs]

The first process is holding element 0 while trying to acquire element 2 and the second is holding element 2 while trying to acquire element 0. If the stock ZFS code were being used, the locks would simply show up as "z_hold_mtx[i]". Here's the interesting part of the stack trace from the first process:

[<ffffffffa062d6a3>] zfs_zinactive+0x233/0x660 [zfs]
[<ffffffffa061e9e2>] zfs_inactive+0x72/0x480 [zfs]
[<ffffffffa0647f33>] zpl_evict_inode+0x43/0x90 [zfs]
[<ffffffff811e4ed0>] evict+0xb0/0x1b0
[<ffffffff811e5009>] dispose_list+0x39/0x50
[<ffffffff811e5f37>] prune_icache_sb+0x47/0x60
[<ffffffff811cc119>] super_cache_scan+0xf9/0x160
[<ffffffff8116a727>] shrink_slab+0x1c7/0x370
[<ffffffff8116d6cd>] do_try_to_free_pages+0x3ed/0x540
[<ffffffff8116d90c>] try_to_free_pages+0xec/0x180
[<ffffffff81162218>] __alloc_pages_nodemask+0x838/0xb90
[<ffffffff811a116e>] alloc_pages_current+0xfe/0x1c0
[<ffffffff811a9ed5>] new_slab+0x295/0x320
[<ffffffff81738721>] __slab_alloc+0x2f8/0x4a8
[<ffffffff811ac684>] kmem_cache_alloc+0x184/0x1d0
[<ffffffffa0264799>] spl_kmem_cache_alloc+0xe9/0xca0 [spl]
[<ffffffffa063a624>] zio_create+0x84/0x780 [zfs]
[<ffffffffa063af51>] zio_null+0x61/0x70 [zfs]
[<ffffffffa063af7e>] zio_root+0x1e/0x20 [zfs]
[<ffffffffa054b31f>] dbuf_read+0xd2f/0x11d0 [zfs]
[<ffffffffa055820c>] dmu_bonus_hold+0x13c/0x450 [zfs]
[<ffffffffa05a9e9e>] sa_buf_hold+0xe/0x10 [zfs]
[<ffffffffa062b844>] zfs_zget+0xd4/0x9b0 [zfs]
[<ffffffffa05ff07a>] zfs_dirent_lock+0x6da/0xab0 [zfs]
[<ffffffffa05ff4d1>] zfs_dirlook+0x81/0x430 [zfs]
[<ffffffffa061cefe>] zfs_lookup+0x3de/0x490 [zfs]
[<ffffffffa064712b>] zpl_lookup+0x6b/0x150 [zfs]
[<ffffffff811d26dd>] lookup_real+0x1d/0x50

The stack trace from the second process is almost identical so I'll not post it. This class of deadlock appears to be caused by the hash table use of the mutex array: there are only 256 mutexes for an arbitrary number of objects.

The most obvious solutions to fix this deadlock are to either use spl_fstrans_mark() in zfs_zget() or to use the new MUTEX_FSTRANS flag on the mutexes. Unfortunately, neither work to completely eliminate this class of deadlocks. Adding the fstrans flag to zfs_zget() allows the same deadlock to occur via a different path. The common path becomes sa_hold_buf.

[<ffffffffa0a6b58e>] zfs_zinactive+0xae/0x660 [zfs]
[<ffffffffa0a5c9e2>] zfs_inactive+0x72/0x480 [zfs]
[<ffffffffa0a85fa3>] zpl_evict_inode+0x43/0x90 [zfs]
[<ffffffff811e4ed0>] evict+0xb0/0x1b0
[<ffffffff811e5009>] dispose_list+0x39/0x50
[<ffffffff811e5f37>] prune_icache_sb+0x47/0x60
[<ffffffff811cc119>] super_cache_scan+0xf9/0x160
[<ffffffff8116a727>] shrink_slab+0x1c7/0x370
[<ffffffff8116d6cd>] do_try_to_free_pages+0x3ed/0x540
[<ffffffff8116d90c>] try_to_free_pages+0xec/0x180
[<ffffffff81162218>] __alloc_pages_nodemask+0x838/0xb90
[<ffffffff811a116e>] alloc_pages_current+0xfe/0x1c0
[<ffffffff811a9ed5>] new_slab+0x295/0x320
[<ffffffff81738721>] __slab_alloc+0x2f8/0x4a8
[<ffffffff811ac684>] kmem_cache_alloc+0x184/0x1d0
[<ffffffffa037b799>] spl_kmem_cache_alloc+0xe9/0xca0 [spl]
[<ffffffffa0a78694>] zio_create+0x84/0x780 [zfs]
[<ffffffffa0a78fc1>] zio_null+0x61/0x70 [zfs]
[<ffffffffa0a78fee>] zio_root+0x1e/0x20 [zfs]
[<ffffffffa098931f>] dbuf_read+0xd2f/0x11d0 [zfs]
[<ffffffffa099620c>] dmu_bonus_hold+0x13c/0x450 [zfs]
[<ffffffffa09e7e9e>] sa_buf_hold+0xe/0x10 [zfs]
[<ffffffffa0a66b37>] zfs_mknode+0x1b7/0x10a0 [zfs]
[<ffffffffa0a6168c>] zfs_create+0x62c/0x9b0 [zfs]
[<ffffffffa0a85a54>] zpl_create+0x94/0x1c0 [zfs]
[<ffffffff811d7b2d>] vfs_create+0xcd/0x130

Again, the second stack is pretty much identical.

Moving the whole array of mutex under the FSTRANS lock mode still allows a deadlock to occur but I've not yet identified the paths involved.

The reproducing scripts are available in https://github.com/dweeezil/stress-tests. They're trivial scripts which perform a bunch of simple operations in parallel. Given a fully populated test tree, I've generally only needed to run the create.sh script in order to produce a deadlock. With the cut-down pseudo mutex array under MUTEX_FSTRANS, I generally also need to run the remove-random.sh script concurrently in order to produce a deadlock.

One other note is that I first encountered these deadlocks on a system with either 4 or 8GiB of RAM (I can't remember which). I've been using mem=4g in my kernel boot line to reproduce this. I'll also note that my test system has no swap space. My goal was to exercise the system with highly parallel operations and with lots of reclaim happening.

These scripts are trivially simple and with stock code on a system in which reclaim is necessary will generally cause a deadlock in less than a minute so I do consider this to be a Big Problem.

dweeezil commented 9 years ago

This patch helps the most common cases. I can still get other deadlocks, however.

diff --git a/module/zfs/sa.c b/module/zfs/sa.c
index 9063d1d..8c66508 100644
--- a/module/zfs/sa.c
+++ b/module/zfs/sa.c
@@ -1436,7 +1436,13 @@ sa_handle_get(objset_t *objset, uint64_t objid, void *userp,
 int
 sa_buf_hold(objset_t *objset, uint64_t obj_num, void *tag, dmu_buf_t **db)
 {
-       return (dmu_bonus_hold(objset, obj_num, tag, db));
+       fstrans_cookie_t cookie;
+       int rval;
+
+       cookie = spl_fstrans_mark();
+       rval = (dmu_bonus_hold(objset, obj_num, tag, db));
+       spl_fstrans_unmark(cookie);
+       return (rval);
 }

 void

kernelOfTruth commented 9 years ago

@dweeezil much appreciated ! I'll add that patch on top of my stack

Everyone has their kernel built with CONFIG_PARAVIRT_SPINLOCKS and thus Paravirt support ?

There are also seemingly lockups that tend to occur with higher probability when that option is NOT enabled: https://github.com/zfsonlinux/zfs/issues/3091

snajpa commented 9 years ago

For example RHEL6 kernel is built without CONFIG_PARAVIRT_SPINLOCKS.

dweeezil commented 9 years ago

The patches in dweeezil/zfs@f04f972 make my test system much more resistant to deadlocks but there are still some remaining.

I'd also like to note that blaming this on the "kmem-rework merge" is not totally accurate. This class of deadlocks are more correctly related to deprecating KM_PUSHPAGE and relying instead on the correct use of PF_FSTRANS. The merge simply gave us the ability to use the PF_FSTRANS flag.

dweeezil commented 9 years ago

I'm withdrawing my last comment. More information as it becomes available.

dweeezil commented 9 years ago

In a previous posting, I suggested these deadlocks would also occur prior to the kmem-rework merge. That actually is not quite the case, however, there is some definite Bad Behavior when running these stress tests pre-kmem-rework merge (SPL at zfsonlinux/spl@47af4b7 and ZFS at d958324f9) on a system with memory limited to 4GiB. Sorry for the confusion (I wasn't running the versions I though I was due to a "+" sneaking into the localversion of the kernel under which I wanted to run the tests).

tuxoko commented 9 years ago

@dweeezil Nice catch, but I have a stupid question. What is this mutex ZFS_OBJ_HOLD_ENTER used for? I can't make any sense out of it.

kernelOfTruth commented 9 years ago

not exactly sure under which issue or pull-request to put this but

http://marc.info/?l=linux-kernel&m=142601686103000&w=2 [PATCH 0/9] Sync and VFS scalability improvements

Seems related and could be helpful in countering some lockups, if not directly - then to get a better knowledge on kernel internals ...

I recently ran into a problem where I had millions of inodes that needed to be evicted at unmount time and it soft locked up the box and kept any other file system from doing work. These patches fix this problem by breaking the inode_sb_list_lock into per-sb, and then converting the per sb inode list into a list_lru for better scalability.

dweeezil commented 9 years ago

Without addressing either @tuxoko's last question or the posting referred to by @kernelOfTruth (both of which I'd like to address ASAP), here's a quick update on this issue: Although the patch(es) in https://github.com/dweeezil/zfs/tree/post-kmem-rework-deadlocks have made my test system survive the stress testing much better, there are still some more deadlocks I can reproduce and need to track down (not counting @tuxoko's L2ARC fix). One remaining issue appears to involve zf_rwlock and dmu_zfetch() and friends (noting that my patch has already locked down the dbuf_prefetch() calls in dmu_prefetch()).

I'm more than a little concerned about the manner in which these deadlocks need to be addressed. Previously, as @behlendorf has mentioned, large number of allocations were under the (old semantics of) KM_PUSHPAGE but now we're relying on PF_FSTRANS or MUTEX_FSTRANS to prevent re-entering inappropriately during reclaim. My concern is whether we can get back to the level of reliability we had in a timely fashion (given that, apparently, there had been a push for a new tag/release).

I don't want to suggest we revert to the old way. In fact, running my stress tests on pre-kmem-rework code result in other Very Bad Things happening (lots of write issue threads spinning, etc.) so we're definitely heading in the right direction.

Finally, I'd like to address something tangentially related to @tuxoko's query as to the purpose of the z_hold_mtx[] array: I've been puzzling over the if (!ZFS_OBJ_HOLD_OWNED(zsb, z_id)) { test put into zfs_zinactive(0 in d6bd8eaa, the comments for which indicate it was added to address a deadlock, however, the test essentially performs the m_owner check against current which is exactly what we ASSERT in mutex_enter(). In other words, the test appears to be checking for something which we require to be true in the first place in order to take a lock. This must be an area of difference between Linux and Solaris about which I'm not clear.

behlendorf commented 9 years ago

First off thanks @dweeezil and @tuxoko for taking the time to look at this issue. Let me try and shed some light on this.

What is this mutex ZFS_OBJ_HOLD_ENTER used for?

Good question, and I wish I had an authoritative answer. This is something which was preserved from the upstream OpenSolaris code but it's never been 100% clear to me we actually need it. It's designed to serialize access to a given object id from the VFS, and the comments suggest this is important for dmu_buf_hold, but it's not at all clear to me why this is needed.

KM_PUSHPAGE but now we're relying on PF_FSTRANS or MUTEX_FSTRANS

This is definitely a step in the right direction since it leverages the proper Linux infrastructure. However, it might take us a little while to run down all these new potential deadlocks which were hidden before. We could consider applying PF_FSTRANS more broadly as a stop gap measure. This would be roughly equivalent the the previous KM_PUSHPAGE approach which was broadly applied. One very easy way to do this would be to change the zsb->z_hold_mtx type to MUTEX_FSTRANS.

I've been puzzling over the if (!ZFS_OBJ_HOLD_OWNED(zsb, z_id)) { test put into zfs_zinactive(0 in d6bd8ea.

This is actually a fix which predates the wide use of KM_PUSHPAGE. What could happen is that in a KM_SLEEP allocation could causes us to to enter zfs_zinactive() while we were already holding the mutex. Flagging all the allocations with KM_PUSHPAGE also resolved this but the fix was never reverted. Commit 7f3e4662832269b687ff20dafc6a33f8e1d28912 in master should also prevent this so it should be safe to revert this hunk.

behlendorf commented 9 years ago

@dweeezil let me propose #3196 for feedback/testing. It's definitely takes a big hammer approach to this issue, and I don't feel it's the best long term solution. But functionally it's not that far removed from our previous extensive use of KM_PUSHPAGE.

tuxoko commented 9 years ago

@behlendorf I noticed something a bit concerning in the MUTEX_FSTRANS patch which I didn't when I reviewed the patch. The way it save/restore flags requires that mutexes are strictly released in reverse order as they are aquired.

Consider mutex A and B both are MUTEX_FSTRANS:

mutex_enter(A):
  current->flags = FS_TRANS
   A->save_flags = !FS_TRANS

mutex_enter(B):
  current->flags = FS_TRANS
  B->save_flags = FS_TRANS

mutex_exit(A)
  current->flags = !FS_TRANS

//memory allocation here could cause deadlock on B eventhough it's MUTEX_FSTRANS

mutex_exit(B)
  current->flags = FS_TRANS

//we don't hold any mutex here but we are still FS_TRANS

I don't know if ZFS contain this kind of ordering, but I don't think it would be a good idea to have this kind of restriction on the mutex. Also, I don't think this could be solved transparently unless we have some sort of per thread counter ...

tuxoko commented 9 years ago

@behlendorf Also, in 4ec15b8dcf8038aeb15c7877c50d0fa500b468c6, you use MUTEX_FSTRANS on dbuf hash_mutexes. But there's no memory allocations in the hash_mutexes context, so it doesn't actually do anything. For the deadlocks like #3055, you need use FSTRANS in db_mtx context. That's where the memory allocation occurs and cause lock inversion.

DeHackEd commented 9 years ago

My most recent deadlock is as follows:

Kernel threads:

kswapd0         D 0000000000000001     0   388      2 0x00000000
ffff88041bc7b9f0 0000000000000046 ffff88001d6805f8 ffff88041bc7a010
ffff88041d9e7920 0000000000010900 ffff88041bc7bfd8 0000000000004000
ffff88041bc7bfd8 0000000000010900 ffff88040da73960 ffff88041d9e7920
Call Trace:
[<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs]
[<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs]
[<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs]
[<ffffffff81311c28>] schedule+0x5f/0x61
[<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb
[<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123
[<ffffffff813109fa>] mutex_lock+0x1e/0x32
[<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs]
[<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs]
[<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs]
[<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59
[<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs]
[<ffffffff810cd1b4>] evict+0xa2/0x155
[<ffffffff810cd71b>] dispose_list+0x3d/0x4a
[<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8
[<ffffffff810bb67f>] prune_super+0xe3/0x158
[<ffffffff8108b0bf>] shrink_slab+0x122/0x19c
[<ffffffff8108bf56>] balance_pgdat+0x3a8/0x7ea
[<ffffffff8108c64c>] kswapd+0x2b4/0x2e8
[<ffffffff8103f2c0>] ? wake_up_bit+0x25/0x25
[<ffffffff8108c398>] ? balance_pgdat+0x7ea/0x7ea
[<ffffffff8103eecc>] kthread+0x84/0x8c
[<ffffffff81314014>] kernel_thread_helper+0x4/0x10
[<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d
[<ffffffff81314010>] ? gs_change+0xb/0xb
khugepaged      D 0000000000000000     0   493      2 0x00000000
ffff88041bd8b850 0000000000000046 ffff88041204da78 ffff88041bd8a010
ffff88041da8b300 0000000000010900 ffff88041bd8bfd8 0000000000004000
ffff88041bd8bfd8 0000000000010900 ffffffff81613020 ffff88041da8b300
Call Trace:
[<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs]
[<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs]
[<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs]
[<ffffffff81311c28>] schedule+0x5f/0x61
[<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb
[<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123
[<ffffffff813109fa>] mutex_lock+0x1e/0x32
[<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs]
[<ffffffff810adce9>] ? unfreeze_partials+0x1eb/0x1eb
[<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs]
[<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs]
[<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59
[<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs]
[<ffffffff810cd1b4>] evict+0xa2/0x155
[<ffffffff810cd71b>] dispose_list+0x3d/0x4a
[<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8
[<ffffffff810bb67f>] prune_super+0xe3/0x158
[<ffffffff8108b0bf>] shrink_slab+0x122/0x19c
[<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4
[<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a
[<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e
[<ffffffff810b351c>] khugepaged+0x91/0x1051
[<ffffffff8103f2c0>] ? wake_up_bit+0x25/0x25
[<ffffffff810b348b>] ? collect_mm_slot+0xa0/0xa0
[<ffffffff8103eecc>] kthread+0x84/0x8c
[<ffffffff81314014>] kernel_thread_helper+0x4/0x10
[<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d
[<ffffffff81314010>] ? gs_change+0xb/0xb
arc_adapt       D 0000000000000000     0  1593      2 0x00000000
ffff880413701b20 0000000000000046 ffff880413701b30 ffff880413700010
ffff88041da09980 0000000000010900 ffff880413701fd8 0000000000004000
ffff880413701fd8 0000000000010900 ffff88018ced1320 ffff88041da09980
Call Trace:
[<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs]
[<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs]
[<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs]
[<ffffffff81311c28>] schedule+0x5f/0x61
[<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb
[<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123
[<ffffffff813109fa>] mutex_lock+0x1e/0x32
[<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs]
[<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs]
[<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs]
[<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59
[<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs]
[<ffffffff810cd1b4>] evict+0xa2/0x155
[<ffffffff810cd71b>] dispose_list+0x3d/0x4a
[<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8
[<ffffffff810bb67f>] prune_super+0xe3/0x158
[<ffffffffa016fc93>] zfs_sb_prune+0x76/0x9b [zfs]
[<ffffffffa018d119>] zpl_prune_sb+0x21/0x24 [zfs]
[<ffffffffa00f0a37>] arc_adapt_thread+0x28b/0x497 [zfs]
[<ffffffffa018d0f8>] ? zpl_inode_alloc+0x6b/0x6b [zfs]
[<ffffffffa00f07ac>] ? arc_adjust+0x105/0x105 [zfs]
[<ffffffffa00a4cd8>] ? __thread_create+0x122/0x122 [spl]
[<ffffffffa00a4d44>] thread_generic_wrapper+0x6c/0x79 [spl]
[<ffffffffa00a4cd8>] ? __thread_create+0x122/0x122 [spl]
[<ffffffff8103eecc>] kthread+0x84/0x8c
[<ffffffff81314014>] kernel_thread_helper+0x4/0x10
[<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d
[<ffffffff81314010>] ? gs_change+0xb/0xb

User proc(s):

usage.pl        D 0000000000000001     0  8726   8169 0x00000000
ffff880206af11c8 0000000000000082 ffff880206af1128 ffff880206af0010
ffff880365a31980 0000000000010900 ffff880206af1fd8 0000000000004000
ffff880206af1fd8 0000000000010900 ffff88040d400000 ffff880365a31980
Call Trace:
[<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs]
[<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs]
[<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs]
[<ffffffff812063a5>] ? scsi_dma_map+0x82/0x9d
[<ffffffff81311c28>] schedule+0x5f/0x61
[<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb
[<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123
[<ffffffff813109fa>] mutex_lock+0x1e/0x32
[<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs]
[<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs]
[<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs]
[<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59
[<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs]
[<ffffffff810cd1b4>] evict+0xa2/0x155
[<ffffffff810cd71b>] dispose_list+0x3d/0x4a
[<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8
[<ffffffff810bb67f>] prune_super+0xe3/0x158
[<ffffffff8108b0bf>] shrink_slab+0x122/0x19c
[<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4
[<ffffffff810845c9>] ? get_page_from_freelist+0x477/0x4a8
[<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a
[<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e
[<ffffffffa014c861>] ? vdev_disk_io_start+0x13b/0x156 [zfs]
[<ffffffff810af134>] new_slab+0x65/0x1ee
[<ffffffff810af5c7>] __slab_alloc.clone.1+0x15c/0x239
[<ffffffffa00a2c91>] ? spl_kmem_alloc+0xfc/0x17d [spl]
[<ffffffffa00f68fa>] ? dbuf_read+0x632/0x832 [zfs]
[<ffffffffa00a2c91>] ? spl_kmem_alloc+0xfc/0x17d [spl]
[<ffffffff810af778>] __kmalloc+0xd4/0x127
[<ffffffffa00a2c91>] spl_kmem_alloc+0xfc/0x17d [spl]
[<ffffffffa010fb38>] dnode_hold_impl+0x24f/0x53e [zfs]
[<ffffffffa010fe3b>] dnode_hold+0x14/0x16 [zfs]
[<ffffffffa00ff716>] dmu_bonus_hold+0x24/0x288 [zfs]
[<ffffffff813109ed>] ? mutex_lock+0x11/0x32
[<ffffffffa012e343>] sa_buf_hold+0x9/0xb [zfs]
[<ffffffffa017d296>] zfs_zget+0xc9/0x3b3 [zfs]
[<ffffffffa0157481>] ? zap_unlockdir+0x8f/0xa3 [zfs]
[<ffffffffa0161cc9>] zfs_dirent_lock+0x4c6/0x50b [zfs]
[<ffffffffa016208b>] zfs_dirlook+0x21b/0x284 [zfs]
[<ffffffffa015c15c>] ? zfs_zaccess+0x2b0/0x346 [zfs]
[<ffffffffa017915d>] zfs_lookup+0x267/0x2ad [zfs]
[<ffffffffa018c4a3>] zpl_lookup+0x5b/0xad [zfs]
[<ffffffff810c1d38>] __lookup_hash+0xb9/0xdd
[<ffffffff810c1fc5>] do_lookup+0x269/0x2a4
[<ffffffff810c343a>] path_lookupat+0xe9/0x5c5
[<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0
[<ffffffff810c3934>] do_path_lookup+0x1e/0x54
[<ffffffff810c3ba2>] user_path_at_empty+0x50/0x96
[<ffffffff813109ed>] ? mutex_lock+0x11/0x32
[<ffffffffa01769f9>] ? zfs_getattr_fast+0x175/0x186 [zfs]
[<ffffffff810c3bb7>] ? user_path_at_empty+0x65/0x96
[<ffffffff810bc8c0>] ? cp_new_stat+0xf2/0x108
[<ffffffff810c3bf4>] user_path_at+0xc/0xe
[<ffffffff810bca50>] vfs_fstatat+0x3d/0x68
[<ffffffff810bcb54>] vfs_stat+0x16/0x18
[<ffffffff810bcb70>] sys_newstat+0x1a/0x38
[<ffffffff81312f62>] system_call_fastpath+0x16/0x1b
usage.pl        D 0000000000000001     0  9251   8462 0x00000000
ffff88010a7baca8 0000000000000082 ffff88010a7bab98 ffff88010a7ba010
ffff88040da75940 0000000000010900 ffff88010a7bbfd8 0000000000004000
ffff88010a7bbfd8 0000000000010900 ffff880365a33300 ffff88040da75940
Call Trace:
[<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs]
[<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs]
[<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs]
[<ffffffff81311c28>] schedule+0x5f/0x61
[<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb
[<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123
[<ffffffff813109fa>] mutex_lock+0x1e/0x32
[<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs]
[<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs]
[<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs]
[<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59
[<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs]
[<ffffffff810cd1b4>] evict+0xa2/0x155
[<ffffffff810cd71b>] dispose_list+0x3d/0x4a
[<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8
[<ffffffff810bb67f>] prune_super+0xe3/0x158
[<ffffffff8108b0bf>] shrink_slab+0x122/0x19c
[<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4
[<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a
[<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e
[<ffffffff810af134>] new_slab+0x65/0x1ee
[<ffffffff810af5c7>] __slab_alloc.clone.1+0x15c/0x239
[<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl]
[<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl]
[<ffffffff810af818>] kmem_cache_alloc+0x4d/0xa0
[<ffffffffa00a36cf>] spl_kmem_cache_alloc+0xb1/0x7ef [spl]
[<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0
[<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl]
[<ffffffffa0185ccb>] zio_create+0x50/0x2ce [zfs]
[<ffffffffa0186d50>] zio_read+0x97/0x99 [zfs]
[<ffffffffa00f183a>] ? arc_fini+0x6be/0x6be [zfs]
[<ffffffffa00efb7f>] arc_read+0x8f5/0x93a [zfs]
[<ffffffffa00a2b14>] ? spl_kmem_zalloc+0xf1/0x172 [spl]
[<ffffffffa00f92d2>] dbuf_prefetch+0x2a7/0x2c9 [zfs]
[<ffffffffa010c2c4>] dmu_zfetch_dofetch+0xe0/0x122 [zfs]
[<ffffffffa010ca42>] dmu_zfetch+0x73c/0xfb9 [zfs]
[<ffffffffa00f640c>] dbuf_read+0x144/0x832 [zfs]
[<ffffffffa00a2b14>] ? spl_kmem_zalloc+0xf1/0x172 [spl]
[<ffffffff810af3b6>] ? kfree+0x1b/0xad
[<ffffffffa00a274c>] ? spl_kmem_free+0x35/0x37 [spl]
[<ffffffffa010face>] dnode_hold_impl+0x1e5/0x53e [zfs]
[<ffffffffa010fe3b>] dnode_hold+0x14/0x16 [zfs]
[<ffffffffa00ff716>] dmu_bonus_hold+0x24/0x288 [zfs]
[<ffffffff813109ed>] ? mutex_lock+0x11/0x32
[<ffffffffa012e343>] sa_buf_hold+0x9/0xb [zfs]
[<ffffffffa017d296>] zfs_zget+0xc9/0x3b3 [zfs]
[<ffffffffa0157481>] ? zap_unlockdir+0x8f/0xa3 [zfs]
[<ffffffffa0161cc9>] zfs_dirent_lock+0x4c6/0x50b [zfs]
[<ffffffffa016208b>] zfs_dirlook+0x21b/0x284 [zfs]
[<ffffffffa015c15c>] ? zfs_zaccess+0x2b0/0x346 [zfs]
[<ffffffffa017915d>] zfs_lookup+0x267/0x2ad [zfs]
[<ffffffffa018c4a3>] zpl_lookup+0x5b/0xad [zfs]
[<ffffffff810c1d38>] __lookup_hash+0xb9/0xdd
[<ffffffff810c1fc5>] do_lookup+0x269/0x2a4
[<ffffffff810c343a>] path_lookupat+0xe9/0x5c5
[<ffffffff810c23b5>] ? getname_flags+0x2b/0x1bc
[<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0
[<ffffffff810c3934>] do_path_lookup+0x1e/0x54
[<ffffffff810c3ba2>] user_path_at_empty+0x50/0x96
[<ffffffff813109ed>] ? mutex_lock+0x11/0x32
[<ffffffffa01769f9>] ? zfs_getattr_fast+0x175/0x186 [zfs]
[<ffffffff810bc8c0>] ? cp_new_stat+0xf2/0x108
[<ffffffff810c3bf4>] user_path_at+0xc/0xe
[<ffffffff810bca50>] vfs_fstatat+0x3d/0x68
[<ffffffff810bcb54>] vfs_stat+0x16/0x18
[<ffffffff810bcb70>] sys_newstat+0x1a/0x38
[<ffffffff81312f62>] system_call_fastpath+0x16/0x1b

Kernel version 3.2.86

SPL: Loaded module v0.6.3-77_g79a0056
ZFS: Loaded module v0.6.3-265_gb2e5a32, ZFS pool version 5000, ZFS filesystem version 5
  pool: whoopass1
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: scrub canceled on Mon Mar 23 09:37:24 2015
config:
    NAME         STATE     READ WRITE CKSUM
    whoopass1    ONLINE       0     0     0
      sdc        ONLINE       0     0     0
    cache
      SSD-l2arc  ONLINE       0     0     0
errors: No known data errors

Events in progress at time of hang are primarily a boatload of du style operations which would send the load average to ~24 (legitimately).

dweeezil commented 9 years ago

Ugh... posted this in the wrong place a little while ago...

I think I've tracked down another deadlock. This one involves zf_rwlock acquired in dmu_zfetch() and z_hold_mtx acquired in zfs_get().

One process has this stack:

 [<ffffffffa0704c27>] dmu_zfetch+0x327/0x1880 [zfs] (tries to acquire zf_rwlock)
 [<ffffffffa06dfa81>] dbuf_read+0x2e1/0x11d0 [zfs]
 [<ffffffffa0708c9e>] dnode_hold_impl+0x11e/0xb40 [zfs]
 [<ffffffffa07096d9>] dnode_hold+0x19/0x20 [zfs]
 [<ffffffffa06ef37c>] dmu_bonus_hold+0x2c/0x450 [zfs]
 [<ffffffffa073f2be>] sa_buf_hold+0x2e/0x80 [zfs]
 [<ffffffffa07c0549>] zfs_zget+0xb9/0x690 [zfs] (acquires a z_hold_mtx[] entry)
 [<ffffffffa079443a>] zfs_dirent_lock+0x6da/0xab0 [zfs] 
 [<ffffffffa0794891>] zfs_dirlook+0x81/0x430 [zfs]
 [<ffffffffa07b205e>] zfs_lookup+0x3de/0x490 [zfs]
 [<ffffffffa07db22b>] zpl_lookup+0x6b/0x150 [zfs]

and the other is like this (it's a long one):

 [<ffffffffa07c1892>] zfs_zinactive+0x92/0x3d0 [zfs] (tries to acquire the z_hold_mtx[] entry)
 [<ffffffffa07b3b42>] zfs_inactive+0x72/0x480 [zfs]
 [<ffffffffa07dc033>] zpl_evict_inode+0x43/0x90 [zfs]
 [<ffffffff811e4ed0>] evict+0xb0/0x1b0
 [<ffffffff811e5009>] dispose_list+0x39/0x50
 [<ffffffff811e5f37>] prune_icache_sb+0x47/0x60
 [<ffffffff811cc119>] super_cache_scan+0xf9/0x160
 [<ffffffff8116a727>] shrink_slab+0x1c7/0x370
 [<ffffffff8116d6cd>] do_try_to_free_pages+0x3ed/0x540
Uh oh....
 [<ffffffff8116d90c>] try_to_free_pages+0xec/0x180
 [<ffffffff81162218>] __alloc_pages_nodemask+0x838/0xb90
 [<ffffffff811a116e>] alloc_pages_current+0xfe/0x1c0
 [<ffffffff811a9ed5>] new_slab+0x295/0x320
 [<ffffffff81738721>] __slab_alloc+0x2f8/0x4a8
 [<ffffffff811ac684>] kmem_cache_alloc+0x184/0x1d0
 [<ffffffffa0307799>] spl_kmem_cache_alloc+0xe9/0xca0 [spl]
 [<ffffffffa07ce724>] zio_create+0x84/0x780 [zfs]
 [<ffffffffa07cfe7d>] zio_vdev_child_io+0xdd/0x190 [zfs]
 [<ffffffffa0777e57>] vdev_mirror_io_start+0xa7/0x1d0 [zfs]
 [<ffffffffa07d3197>] zio_vdev_io_start+0x2d7/0x560 [zfs]
 [<ffffffffa07d07df>] zio_nowait+0x10f/0x310 [zfs]
 [<ffffffffa06d0673>] arc_read+0x683/0x1270 [zfs]
 [<ffffffffa06e17ba>] dbuf_prefetch+0x2da/0x5a0 [zfs]
 [<ffffffffa07042c4>] dmu_zfetch_dofetch.isra.4+0x104/0x1c0 [zfs]
 [<ffffffffa07050d3>] dmu_zfetch+0x7d3/0x1880 [zfs] (acquires zf_rwlock)
 [<ffffffffa06e04f4>] dbuf_read+0xd54/0x11d0 [zfs]
 [<ffffffffa0707c68>] dnode_next_offset_level+0x398/0x640 [zfs]
 [<ffffffffa070de76>] dnode_next_offset+0x96/0x470 [zfs]
 [<ffffffffa06f1d03>] dmu_object_next+0x43/0x60 [zfs]
 [<ffffffffa06f1e60>] dmu_object_alloc+0x140/0x270 [zfs]
 [<ffffffffa07bdb58>] zfs_mknode+0xf8/0xf10 [zfs]
 [<ffffffffa07b87ec>] zfs_create+0x62c/0x9b0 [zfs]
 [<ffffffffa07dbae4>] zpl_create+0x94/0x1c0 [zfs]

Deadlock...

dweeezil commented 9 years ago

I had been zeroing in on dmu_zfetch & friends earlier on in my work on this.

behlendorf commented 9 years ago

@dweeezil could you open a pull request with the fixes you've already put together for the deadlocks. I don't merged anything until we're all agreed they're the right fixes but it would be nice to have them all in on place.

dweeezil commented 9 years ago

@behlendorf I'll do that later this evening. I was able to run my stress test with my latest patch for awhile this afternoon with no hangs.

dweeezil commented 9 years ago

@kpande Thanks. I'll follow-up later today. I'm going to post my WIP branch as a pull request shortly so it can get some more attention and will add a couple more comments there.

dweeezil commented 9 years ago

I'm going to be moving all my future updates to #3225. I'm still able to create deadlocks, but have to try much much harder now. Follow-up in the referenced pull request.

behlendorf commented 9 years ago

The vast majority of these deadlocks have been resolved by the patch in #3225. We'll address any remaining issues in individual issues.

openzfs / zfs

Continued deadlocks following kmem-rework merge #3183