Closed dweeezil closed 9 years ago
This patch helps the most common cases. I can still get other deadlocks, however.
diff --git a/module/zfs/sa.c b/module/zfs/sa.c
index 9063d1d..8c66508 100644
--- a/module/zfs/sa.c
+++ b/module/zfs/sa.c
@@ -1436,7 +1436,13 @@ sa_handle_get(objset_t *objset, uint64_t objid, void *userp,
int
sa_buf_hold(objset_t *objset, uint64_t obj_num, void *tag, dmu_buf_t **db)
{
- return (dmu_bonus_hold(objset, obj_num, tag, db));
+ fstrans_cookie_t cookie;
+ int rval;
+
+ cookie = spl_fstrans_mark();
+ rval = (dmu_bonus_hold(objset, obj_num, tag, db));
+ spl_fstrans_unmark(cookie);
+ return (rval);
}
void
@dweeezil much appreciated ! I'll add that patch on top of my stack
Everyone has their kernel built with CONFIG_PARAVIRT_SPINLOCKS and thus Paravirt support ?
There are also seemingly lockups that tend to occur with higher probability when that option is NOT enabled: https://github.com/zfsonlinux/zfs/issues/3091
For example RHEL6 kernel is built without CONFIG_PARAVIRT_SPINLOCKS.
The patches in dweeezil/zfs@f04f972 make my test system much more resistant to deadlocks but there are still some remaining.
I'd also like to note that blaming this on the "kmem-rework merge" is not totally accurate. This class of deadlocks are more correctly related to deprecating KM_PUSHPAGE and relying instead on the correct use of PF_FSTRANS. The merge simply gave us the ability to use the PF_FSTRANS flag.
I'm withdrawing my last comment. More information as it becomes available.
In a previous posting, I suggested these deadlocks would also occur prior to the kmem-rework merge. That actually is not quite the case, however, there is some definite Bad Behavior when running these stress tests pre-kmem-rework merge (SPL at zfsonlinux/spl@47af4b7 and ZFS at d958324f9) on a system with memory limited to 4GiB. Sorry for the confusion (I wasn't running the versions I though I was due to a "+" sneaking into the localversion of the kernel under which I wanted to run the tests).
@dweeezil
Nice catch, but I have a stupid question.
What is this mutex ZFS_OBJ_HOLD_ENTER
used for? I can't make any sense out of it.
not exactly sure under which issue or pull-request to put this but
http://marc.info/?l=linux-kernel&m=142601686103000&w=2 [PATCH 0/9] Sync and VFS scalability improvements
Seems related and could be helpful in countering some lockups, if not directly - then to get a better knowledge on kernel internals ...
I recently ran into a problem where I had millions of inodes that needed to be evicted at unmount time and it soft locked up the box and kept any other file system from doing work. These patches fix this problem by breaking the inode_sb_list_lock into per-sb, and then converting the per sb inode list into a list_lru for better scalability.
Without addressing either @tuxoko's last question or the posting referred to by @kernelOfTruth (both of which I'd like to address ASAP), here's a quick update on this issue: Although the patch(es) in https://github.com/dweeezil/zfs/tree/post-kmem-rework-deadlocks have made my test system survive the stress testing much better, there are still some more deadlocks I can reproduce and need to track down (not counting @tuxoko's L2ARC fix). One remaining issue appears to involve zf_rwlock
and dmu_zfetch()
and friends (noting that my patch has already locked down the dbuf_prefetch()
calls in dmu_prefetch()
).
I'm more than a little concerned about the manner in which these deadlocks need to be addressed. Previously, as @behlendorf has mentioned, large number of allocations were under the (old semantics of) KM_PUSHPAGE
but now we're relying on PF_FSTRANS
or MUTEX_FSTRANS
to prevent re-entering inappropriately during reclaim. My concern is whether we can get back to the level of reliability we had in a timely fashion (given that, apparently, there had been a push for a new tag/release).
I don't want to suggest we revert to the old way. In fact, running my stress tests on pre-kmem-rework code result in other Very Bad Things happening (lots of write issue threads spinning, etc.) so we're definitely heading in the right direction.
Finally, I'd like to address something tangentially related to @tuxoko's query as to the purpose of the z_hold_mtx[]
array: I've been puzzling over the if (!ZFS_OBJ_HOLD_OWNED(zsb, z_id)) {
test put into zfs_zinactive(0
in d6bd8eaa, the comments for which indicate it was added to address a deadlock, however, the test essentially performs the m_owner
check against current
which is exactly what we ASSERT in mutex_enter()
. In other words, the test appears to be checking for something which we require to be true in the first place in order to take a lock. This must be an area of difference between Linux and Solaris about which I'm not clear.
First off thanks @dweeezil and @tuxoko for taking the time to look at this issue. Let me try and shed some light on this.
What is this mutex ZFS_OBJ_HOLD_ENTER used for?
Good question, and I wish I had an authoritative answer. This is something which was preserved from the upstream OpenSolaris code but it's never been 100% clear to me we actually need it. It's designed to serialize access to a given object id from the VFS, and the comments suggest this is important for dmu_buf_hold
, but it's not at all clear to me why this is needed.
KM_PUSHPAGE but now we're relying on PF_FSTRANS or MUTEX_FSTRANS
This is definitely a step in the right direction since it leverages the proper Linux infrastructure. However, it might take us a little while to run down all these new potential deadlocks which were hidden before. We could consider applying PF_FSTRANS more broadly as a stop gap measure. This would be roughly equivalent the the previous KM_PUSHPAGE approach which was broadly applied. One very easy way to do this would be to change the zsb->z_hold_mtx type to MUTEX_FSTRANS.
I've been puzzling over the if (!ZFS_OBJ_HOLD_OWNED(zsb, z_id)) { test put into zfs_zinactive(0 in d6bd8ea.
This is actually a fix which predates the wide use of KM_PUSHPAGE. What could happen is that in a KM_SLEEP allocation could causes us to to enter zfs_zinactive()
while we were already holding the mutex. Flagging all the allocations with KM_PUSHPAGE also resolved this but the fix was never reverted. Commit 7f3e4662832269b687ff20dafc6a33f8e1d28912 in master should also prevent this so it should be safe to revert this hunk.
@dweeezil let me propose #3196 for feedback/testing. It's definitely takes a big hammer approach to this issue, and I don't feel it's the best long term solution. But functionally it's not that far removed from our previous extensive use of KM_PUSHPAGE.
@behlendorf I noticed something a bit concerning in the MUTEX_FSTRANS patch which I didn't when I reviewed the patch. The way it save/restore flags requires that mutexes are strictly released in reverse order as they are aquired.
Consider mutex A and B both are MUTEX_FSTRANS:
mutex_enter(A):
current->flags = FS_TRANS
A->save_flags = !FS_TRANS
mutex_enter(B):
current->flags = FS_TRANS
B->save_flags = FS_TRANS
mutex_exit(A)
current->flags = !FS_TRANS
//memory allocation here could cause deadlock on B eventhough it's MUTEX_FSTRANS
mutex_exit(B)
current->flags = FS_TRANS
//we don't hold any mutex here but we are still FS_TRANS
I don't know if ZFS contain this kind of ordering, but I don't think it would be a good idea to have this kind of restriction on the mutex. Also, I don't think this could be solved transparently unless we have some sort of per thread counter ...
@behlendorf Also, in 4ec15b8dcf8038aeb15c7877c50d0fa500b468c6, you use MUTEX_FSTRANS on dbuf hash_mutexes. But there's no memory allocations in the hash_mutexes context, so it doesn't actually do anything. For the deadlocks like #3055, you need use FSTRANS in db_mtx context. That's where the memory allocation occurs and cause lock inversion.
My most recent deadlock is as follows:
Kernel threads:
kswapd0 D 0000000000000001 0 388 2 0x00000000 ffff88041bc7b9f0 0000000000000046 ffff88001d6805f8 ffff88041bc7a010 ffff88041d9e7920 0000000000010900 ffff88041bc7bfd8 0000000000004000 ffff88041bc7bfd8 0000000000010900 ffff88040da73960 ffff88041d9e7920 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffff8108b0bf>] shrink_slab+0x122/0x19c [<ffffffff8108bf56>] balance_pgdat+0x3a8/0x7ea [<ffffffff8108c64c>] kswapd+0x2b4/0x2e8 [<ffffffff8103f2c0>] ? wake_up_bit+0x25/0x25 [<ffffffff8108c398>] ? balance_pgdat+0x7ea/0x7ea [<ffffffff8103eecc>] kthread+0x84/0x8c [<ffffffff81314014>] kernel_thread_helper+0x4/0x10 [<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d [<ffffffff81314010>] ? gs_change+0xb/0xb khugepaged D 0000000000000000 0 493 2 0x00000000 ffff88041bd8b850 0000000000000046 ffff88041204da78 ffff88041bd8a010 ffff88041da8b300 0000000000010900 ffff88041bd8bfd8 0000000000004000 ffff88041bd8bfd8 0000000000010900 ffffffff81613020 ffff88041da8b300 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffff810adce9>] ? unfreeze_partials+0x1eb/0x1eb [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffff8108b0bf>] shrink_slab+0x122/0x19c [<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4 [<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a [<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e [<ffffffff810b351c>] khugepaged+0x91/0x1051 [<ffffffff8103f2c0>] ? wake_up_bit+0x25/0x25 [<ffffffff810b348b>] ? collect_mm_slot+0xa0/0xa0 [<ffffffff8103eecc>] kthread+0x84/0x8c [<ffffffff81314014>] kernel_thread_helper+0x4/0x10 [<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d [<ffffffff81314010>] ? gs_change+0xb/0xb arc_adapt D 0000000000000000 0 1593 2 0x00000000 ffff880413701b20 0000000000000046 ffff880413701b30 ffff880413700010 ffff88041da09980 0000000000010900 ffff880413701fd8 0000000000004000 ffff880413701fd8 0000000000010900 ffff88018ced1320 ffff88041da09980 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffffa016fc93>] zfs_sb_prune+0x76/0x9b [zfs] [<ffffffffa018d119>] zpl_prune_sb+0x21/0x24 [zfs] [<ffffffffa00f0a37>] arc_adapt_thread+0x28b/0x497 [zfs] [<ffffffffa018d0f8>] ? zpl_inode_alloc+0x6b/0x6b [zfs] [<ffffffffa00f07ac>] ? arc_adjust+0x105/0x105 [zfs] [<ffffffffa00a4cd8>] ? __thread_create+0x122/0x122 [spl] [<ffffffffa00a4d44>] thread_generic_wrapper+0x6c/0x79 [spl] [<ffffffffa00a4cd8>] ? __thread_create+0x122/0x122 [spl] [<ffffffff8103eecc>] kthread+0x84/0x8c [<ffffffff81314014>] kernel_thread_helper+0x4/0x10 [<ffffffff8103ee48>] ? kthread_freezable_should_stop+0x5d/0x5d [<ffffffff81314010>] ? gs_change+0xb/0xb
User proc(s):
usage.pl D 0000000000000001 0 8726 8169 0x00000000 ffff880206af11c8 0000000000000082 ffff880206af1128 ffff880206af0010 ffff880365a31980 0000000000010900 ffff880206af1fd8 0000000000004000 ffff880206af1fd8 0000000000010900 ffff88040d400000 ffff880365a31980 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff812063a5>] ? scsi_dma_map+0x82/0x9d [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffff8108b0bf>] shrink_slab+0x122/0x19c [<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4 [<ffffffff810845c9>] ? get_page_from_freelist+0x477/0x4a8 [<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a [<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e [<ffffffffa014c861>] ? vdev_disk_io_start+0x13b/0x156 [zfs] [<ffffffff810af134>] new_slab+0x65/0x1ee [<ffffffff810af5c7>] __slab_alloc.clone.1+0x15c/0x239 [<ffffffffa00a2c91>] ? spl_kmem_alloc+0xfc/0x17d [spl] [<ffffffffa00f68fa>] ? dbuf_read+0x632/0x832 [zfs] [<ffffffffa00a2c91>] ? spl_kmem_alloc+0xfc/0x17d [spl] [<ffffffff810af778>] __kmalloc+0xd4/0x127 [<ffffffffa00a2c91>] spl_kmem_alloc+0xfc/0x17d [spl] [<ffffffffa010fb38>] dnode_hold_impl+0x24f/0x53e [zfs] [<ffffffffa010fe3b>] dnode_hold+0x14/0x16 [zfs] [<ffffffffa00ff716>] dmu_bonus_hold+0x24/0x288 [zfs] [<ffffffff813109ed>] ? mutex_lock+0x11/0x32 [<ffffffffa012e343>] sa_buf_hold+0x9/0xb [zfs] [<ffffffffa017d296>] zfs_zget+0xc9/0x3b3 [zfs] [<ffffffffa0157481>] ? zap_unlockdir+0x8f/0xa3 [zfs] [<ffffffffa0161cc9>] zfs_dirent_lock+0x4c6/0x50b [zfs] [<ffffffffa016208b>] zfs_dirlook+0x21b/0x284 [zfs] [<ffffffffa015c15c>] ? zfs_zaccess+0x2b0/0x346 [zfs] [<ffffffffa017915d>] zfs_lookup+0x267/0x2ad [zfs] [<ffffffffa018c4a3>] zpl_lookup+0x5b/0xad [zfs] [<ffffffff810c1d38>] __lookup_hash+0xb9/0xdd [<ffffffff810c1fc5>] do_lookup+0x269/0x2a4 [<ffffffff810c343a>] path_lookupat+0xe9/0x5c5 [<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0 [<ffffffff810c3934>] do_path_lookup+0x1e/0x54 [<ffffffff810c3ba2>] user_path_at_empty+0x50/0x96 [<ffffffff813109ed>] ? mutex_lock+0x11/0x32 [<ffffffffa01769f9>] ? zfs_getattr_fast+0x175/0x186 [zfs] [<ffffffff810c3bb7>] ? user_path_at_empty+0x65/0x96 [<ffffffff810bc8c0>] ? cp_new_stat+0xf2/0x108 [<ffffffff810c3bf4>] user_path_at+0xc/0xe [<ffffffff810bca50>] vfs_fstatat+0x3d/0x68 [<ffffffff810bcb54>] vfs_stat+0x16/0x18 [<ffffffff810bcb70>] sys_newstat+0x1a/0x38 [<ffffffff81312f62>] system_call_fastpath+0x16/0x1b usage.pl D 0000000000000001 0 9251 8462 0x00000000 ffff88010a7baca8 0000000000000082 ffff88010a7bab98 ffff88010a7ba010 ffff88040da75940 0000000000010900 ffff88010a7bbfd8 0000000000004000 ffff88010a7bbfd8 0000000000010900 ffff880365a33300 ffff88040da75940 Call Trace: [<ffffffffa00f58e6>] ? dbuf_rele+0x4b/0x52 [zfs] [<ffffffffa010db75>] ? dnode_rele_and_unlock+0x8c/0x95 [zfs] [<ffffffffa00f535c>] ? dbuf_rele_and_unlock+0x1d3/0x358 [zfs] [<ffffffff81311c28>] schedule+0x5f/0x61 [<ffffffff81311e62>] schedule_preempt_disabled+0x9/0xb [<ffffffff81310679>] __mutex_lock_slowpath+0xde/0x123 [<ffffffff813109fa>] mutex_lock+0x1e/0x32 [<ffffffffa017aa55>] ? zfs_znode_dmu_fini+0x18/0x27 [zfs] [<ffffffffa017b088>] zfs_zinactive+0x61/0x220 [zfs] [<ffffffffa017305d>] zfs_inactive+0x16f/0x20d [zfs] [<ffffffff81088ee7>] ? truncate_pagecache+0x51/0x59 [<ffffffffa018d064>] zpl_evict_inode+0x42/0x55 [zfs] [<ffffffff810cd1b4>] evict+0xa2/0x155 [<ffffffff810cd71b>] dispose_list+0x3d/0x4a [<ffffffff810cd9f1>] prune_icache_sb+0x2c9/0x2d8 [<ffffffff810bb67f>] prune_super+0xe3/0x158 [<ffffffff8108b0bf>] shrink_slab+0x122/0x19c [<ffffffff8108b9b8>] do_try_to_free_pages+0x318/0x4a4 [<ffffffff8108bbac>] try_to_free_pages+0x68/0x6a [<ffffffff81084cd2>] __alloc_pages_nodemask+0x4d1/0x75e [<ffffffff810af134>] new_slab+0x65/0x1ee [<ffffffff810af5c7>] __slab_alloc.clone.1+0x15c/0x239 [<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl] [<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl] [<ffffffff810af818>] kmem_cache_alloc+0x4d/0xa0 [<ffffffffa00a36cf>] spl_kmem_cache_alloc+0xb1/0x7ef [spl] [<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0 [<ffffffffa00a36cf>] ? spl_kmem_cache_alloc+0xb1/0x7ef [spl] [<ffffffffa0185ccb>] zio_create+0x50/0x2ce [zfs] [<ffffffffa0186d50>] zio_read+0x97/0x99 [zfs] [<ffffffffa00f183a>] ? arc_fini+0x6be/0x6be [zfs] [<ffffffffa00efb7f>] arc_read+0x8f5/0x93a [zfs] [<ffffffffa00a2b14>] ? spl_kmem_zalloc+0xf1/0x172 [spl] [<ffffffffa00f92d2>] dbuf_prefetch+0x2a7/0x2c9 [zfs] [<ffffffffa010c2c4>] dmu_zfetch_dofetch+0xe0/0x122 [zfs] [<ffffffffa010ca42>] dmu_zfetch+0x73c/0xfb9 [zfs] [<ffffffffa00f640c>] dbuf_read+0x144/0x832 [zfs] [<ffffffffa00a2b14>] ? spl_kmem_zalloc+0xf1/0x172 [spl] [<ffffffff810af3b6>] ? kfree+0x1b/0xad [<ffffffffa00a274c>] ? spl_kmem_free+0x35/0x37 [spl] [<ffffffffa010face>] dnode_hold_impl+0x1e5/0x53e [zfs] [<ffffffffa010fe3b>] dnode_hold+0x14/0x16 [zfs] [<ffffffffa00ff716>] dmu_bonus_hold+0x24/0x288 [zfs] [<ffffffff813109ed>] ? mutex_lock+0x11/0x32 [<ffffffffa012e343>] sa_buf_hold+0x9/0xb [zfs] [<ffffffffa017d296>] zfs_zget+0xc9/0x3b3 [zfs] [<ffffffffa0157481>] ? zap_unlockdir+0x8f/0xa3 [zfs] [<ffffffffa0161cc9>] zfs_dirent_lock+0x4c6/0x50b [zfs] [<ffffffffa016208b>] zfs_dirlook+0x21b/0x284 [zfs] [<ffffffffa015c15c>] ? zfs_zaccess+0x2b0/0x346 [zfs] [<ffffffffa017915d>] zfs_lookup+0x267/0x2ad [zfs] [<ffffffffa018c4a3>] zpl_lookup+0x5b/0xad [zfs] [<ffffffff810c1d38>] __lookup_hash+0xb9/0xdd [<ffffffff810c1fc5>] do_lookup+0x269/0x2a4 [<ffffffff810c343a>] path_lookupat+0xe9/0x5c5 [<ffffffff810c23b5>] ? getname_flags+0x2b/0x1bc [<ffffffff810af7f1>] ? kmem_cache_alloc+0x26/0xa0 [<ffffffff810c3934>] do_path_lookup+0x1e/0x54 [<ffffffff810c3ba2>] user_path_at_empty+0x50/0x96 [<ffffffff813109ed>] ? mutex_lock+0x11/0x32 [<ffffffffa01769f9>] ? zfs_getattr_fast+0x175/0x186 [zfs] [<ffffffff810bc8c0>] ? cp_new_stat+0xf2/0x108 [<ffffffff810c3bf4>] user_path_at+0xc/0xe [<ffffffff810bca50>] vfs_fstatat+0x3d/0x68 [<ffffffff810bcb54>] vfs_stat+0x16/0x18 [<ffffffff810bcb70>] sys_newstat+0x1a/0x38 [<ffffffff81312f62>] system_call_fastpath+0x16/0x1b
Kernel version 3.2.86
SPL: Loaded module v0.6.3-77_g79a0056 ZFS: Loaded module v0.6.3-265_gb2e5a32, ZFS pool version 5000, ZFS filesystem version 5 pool: whoopass1 state: ONLINE status: Some supported features are not enabled on the pool. The pool can still be used, but some features are unavailable. action: Enable all features using 'zpool upgrade'. Once this is done, the pool may no longer be accessible by software that does not support the features. See zpool-features(5) for details. scan: scrub canceled on Mon Mar 23 09:37:24 2015 config: NAME STATE READ WRITE CKSUM whoopass1 ONLINE 0 0 0 sdc ONLINE 0 0 0 cache SSD-l2arc ONLINE 0 0 0 errors: No known data errors
Events in progress at time of hang are primarily a boatload of du
style operations which would send the load average to ~24 (legitimately).
Ugh... posted this in the wrong place a little while ago...
I think I've tracked down another deadlock. This one involves zf_rwlock acquired in dmu_zfetch() and z_hold_mtx acquired in zfs_get().
One process has this stack:
[<ffffffffa0704c27>] dmu_zfetch+0x327/0x1880 [zfs] (tries to acquire zf_rwlock)
[<ffffffffa06dfa81>] dbuf_read+0x2e1/0x11d0 [zfs]
[<ffffffffa0708c9e>] dnode_hold_impl+0x11e/0xb40 [zfs]
[<ffffffffa07096d9>] dnode_hold+0x19/0x20 [zfs]
[<ffffffffa06ef37c>] dmu_bonus_hold+0x2c/0x450 [zfs]
[<ffffffffa073f2be>] sa_buf_hold+0x2e/0x80 [zfs]
[<ffffffffa07c0549>] zfs_zget+0xb9/0x690 [zfs] (acquires a z_hold_mtx[] entry)
[<ffffffffa079443a>] zfs_dirent_lock+0x6da/0xab0 [zfs]
[<ffffffffa0794891>] zfs_dirlook+0x81/0x430 [zfs]
[<ffffffffa07b205e>] zfs_lookup+0x3de/0x490 [zfs]
[<ffffffffa07db22b>] zpl_lookup+0x6b/0x150 [zfs]
and the other is like this (it's a long one):
[<ffffffffa07c1892>] zfs_zinactive+0x92/0x3d0 [zfs] (tries to acquire the z_hold_mtx[] entry)
[<ffffffffa07b3b42>] zfs_inactive+0x72/0x480 [zfs]
[<ffffffffa07dc033>] zpl_evict_inode+0x43/0x90 [zfs]
[<ffffffff811e4ed0>] evict+0xb0/0x1b0
[<ffffffff811e5009>] dispose_list+0x39/0x50
[<ffffffff811e5f37>] prune_icache_sb+0x47/0x60
[<ffffffff811cc119>] super_cache_scan+0xf9/0x160
[<ffffffff8116a727>] shrink_slab+0x1c7/0x370
[<ffffffff8116d6cd>] do_try_to_free_pages+0x3ed/0x540
Uh oh....
[<ffffffff8116d90c>] try_to_free_pages+0xec/0x180
[<ffffffff81162218>] __alloc_pages_nodemask+0x838/0xb90
[<ffffffff811a116e>] alloc_pages_current+0xfe/0x1c0
[<ffffffff811a9ed5>] new_slab+0x295/0x320
[<ffffffff81738721>] __slab_alloc+0x2f8/0x4a8
[<ffffffff811ac684>] kmem_cache_alloc+0x184/0x1d0
[<ffffffffa0307799>] spl_kmem_cache_alloc+0xe9/0xca0 [spl]
[<ffffffffa07ce724>] zio_create+0x84/0x780 [zfs]
[<ffffffffa07cfe7d>] zio_vdev_child_io+0xdd/0x190 [zfs]
[<ffffffffa0777e57>] vdev_mirror_io_start+0xa7/0x1d0 [zfs]
[<ffffffffa07d3197>] zio_vdev_io_start+0x2d7/0x560 [zfs]
[<ffffffffa07d07df>] zio_nowait+0x10f/0x310 [zfs]
[<ffffffffa06d0673>] arc_read+0x683/0x1270 [zfs]
[<ffffffffa06e17ba>] dbuf_prefetch+0x2da/0x5a0 [zfs]
[<ffffffffa07042c4>] dmu_zfetch_dofetch.isra.4+0x104/0x1c0 [zfs]
[<ffffffffa07050d3>] dmu_zfetch+0x7d3/0x1880 [zfs] (acquires zf_rwlock)
[<ffffffffa06e04f4>] dbuf_read+0xd54/0x11d0 [zfs]
[<ffffffffa0707c68>] dnode_next_offset_level+0x398/0x640 [zfs]
[<ffffffffa070de76>] dnode_next_offset+0x96/0x470 [zfs]
[<ffffffffa06f1d03>] dmu_object_next+0x43/0x60 [zfs]
[<ffffffffa06f1e60>] dmu_object_alloc+0x140/0x270 [zfs]
[<ffffffffa07bdb58>] zfs_mknode+0xf8/0xf10 [zfs]
[<ffffffffa07b87ec>] zfs_create+0x62c/0x9b0 [zfs]
[<ffffffffa07dbae4>] zpl_create+0x94/0x1c0 [zfs]
Deadlock...
I had been zeroing in on dmu_zfetch & friends earlier on in my work on this.
@dweeezil could you open a pull request with the fixes you've already put together for the deadlocks. I don't merged anything until we're all agreed they're the right fixes but it would be nice to have them all in on place.
@behlendorf I'll do that later this evening. I was able to run my stress test with my latest patch for awhile this afternoon with no hangs.
@kpande Thanks. I'll follow-up later today. I'm going to post my WIP branch as a pull request shortly so it can get some more attention and will add a couple more comments there.
I'm going to be moving all my future updates to #3225. I'm still able to create deadlocks, but have to try much much harder now. Follow-up in the referenced pull request.
The vast majority of these deadlocks have been resolved by the patch in #3225. We'll address any remaining issues in individual issues.
As outlined in the kmem-rework merge, and follow-up commits such as 4ec15b8, there are and may be additional code paths within ZFS which require inhibiting filesystem re-entry during reclaim.
I've identified several conditions under which a system can be deadlocked rather easily. The easiest instance to reproduce involves the interaction beteween
zfs_zinactive()
andzfs_zget()
due to the manner in which thez_hold_mtx[]
mutex array is used. zfs_zinactive() has the following code to try to prevent deadlocks:I've produced a patch which flattens the
z_hold_mtx[]
array from a 256-element hash to a series of 4 consecutive mutex (also used in a hash-like manner) in order to get better diagnostics from Linux's lock debugging facilities. Here's an example from one of this class of deadlocks:The first process is holding element 0 while trying to acquire element 2 and the second is holding element 2 while trying to acquire element 0. If the stock ZFS code were being used, the locks would simply show up as "z_hold_mtx[i]". Here's the interesting part of the stack trace from the first process:
The stack trace from the second process is almost identical so I'll not post it. This class of deadlock appears to be caused by the hash table use of the mutex array: there are only 256 mutexes for an arbitrary number of objects.
The most obvious solutions to fix this deadlock are to either use
spl_fstrans_mark()
inzfs_zget()
or to use the newMUTEX_FSTRANS
flag on the mutexes. Unfortunately, neither work to completely eliminate this class of deadlocks. Adding the fstrans flag tozfs_zget()
allows the same deadlock to occur via a different path. The common path becomessa_hold_buf
.Again, the second stack is pretty much identical.
Moving the whole array of mutex under the FSTRANS lock mode still allows a deadlock to occur but I've not yet identified the paths involved.
The reproducing scripts are available in https://github.com/dweeezil/stress-tests. They're trivial scripts which perform a bunch of simple operations in parallel. Given a fully populated test tree, I've generally only needed to run the
create.sh
script in order to produce a deadlock. With the cut-down pseudo mutex array under MUTEX_FSTRANS, I generally also need to run theremove-random.sh
script concurrently in order to produce a deadlock.One other note is that I first encountered these deadlocks on a system with either 4 or 8GiB of RAM (I can't remember which). I've been using
mem=4g
in my kernel boot line to reproduce this. I'll also note that my test system has no swap space. My goal was to exercise the system with highly parallel operations and with lots of reclaim happening.These scripts are trivially simple and with stock code on a system in which reclaim is necessary will generally cause a deadlock in less than a minute so I do consider this to be a Big Problem.