Closed loyou closed 4 years ago
after checked dn=ffff882a3d59a470 in dnode_dest, seems dn->dn_mtx was being hold(mutex_enter) in dbuf_evict_thread and now it has been released(mutex_exit).
crash> dnode_t.dn_mtx ffff882a3d59a470 -xo
typedef struct dnode {
[ffff882a3d59a568] kmutex_t dn_mtx;
} dnode_t;
crash> kmutex_t.m_owner ffff882a3d59a568
m_owner = 0x0
our servers crashed for 5 times with above call stack and 1 newer without kdump for following call stack. I think they are same issue.
Sep 6 18:38:56 10.16.110.66 [5349213.947198] VERIFY3(((_attribute__((unused)) typeof((&dn->dn_mtx)->m_owner) _)&((&dn->dn_mtx)->m_owner); }))) == ((void *)0)) failed (ffff883e7cbe3800 == (null))
Sep 6 18:38:56 10.16.110.66 [5349213.960792] PANIC at dnode.c:178:dnode_dest()
Sep 6 18:38:56 10.16.110.66 [5349213.964431] Kernel panic - not syncing: VERIFY3(((_attribute__((unused)) typeof((&dn->dn_mtx)->m_owner) _)&((&dn->dn_mtx)->m_owner); }))) == ((void *)0)) failed (ffff883e7cbe3800 == (null))
[5349213.964431]
Sep 6 18:38:56 10.16.110.66 [5349213.983351] CPU: 36 PID: 2367 Comm: txg_sync Tainted: P OE 4.4.0-116-generic #140~14.04.1
Sep 6 18:38:57 10.16.110.66 [5349213.991291] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS 4.1.14 01/09/2018
Sep 6 18:38:57 10.16.110.66 [5349213.999843] 0000000000000000
Sep 6 18:38:57 10.16.110.66 ffff883e42287818
Sep 6 18:38:57 10.16.110.66 ffffffff813e416c
Sep 6 18:38:57 10.16.110.66 ffffffffc0464e5e
Sep 6 18:38:57 10.16.110.66
Sep 6 18:38:57 10.16.110.66 [5349214.009000] 00000000000000b2
Sep 6 18:38:57 10.16.110.66 ffff883e42287890
Sep 6 18:38:57 10.16.110.66 ffffffff81186b2c
Sep 6 18:38:57 10.16.110.66 ffff883e00000010
Sep 6 18:38:57 10.16.110.66
Sep 6 18:38:57 10.16.110.66 [5349214.018506] ffff883e422878a0
Sep 6 18:38:57 10.16.110.66 ffff883e42287840
Sep 6 18:38:57 10.16.110.66 ffffffff81187333
Sep 6 18:38:57 10.16.110.66 ffff883e422878c0
Sep 6 18:38:57 10.16.110.66
Sep 6 18:38:57 10.16.110.66 [5349214.028437] Call Trace:
Sep 6 18:38:57 10.16.110.66 [5349214.033347] [<ffffffff813e416c>] dump_stack+0x63/0x87
Sep 6 18:38:57 10.16.110.66 [5349214.038428] [<ffffffff81186b2c>] panic+0xc8/0x221
Sep 6 18:38:57 10.16.110.66 [5349214.043398] [<ffffffff81187333>] ? printk+0x50/0x52
Sep 6 18:38:57 10.16.110.66 [5349214.048284] [<ffffffffc045e75d>] spl_panic+0xfd/0x100 [spl]
Sep 6 18:38:57 10.16.110.66 [5349214.053257] [<ffffffffc0458dda>] ? spl_kmem_free+0x2a/0x40 [spl]
Sep 6 18:38:57 10.16.110.66 [5349214.058379] [<ffffffffc0f6bf72>] dnode_dest+0xd2/0x140 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.063533] [<ffffffffc045a8cd>] spl_kmem_cache_free+0x2d/0x1c0 [spl]
Sep 6 18:38:57 10.16.110.66 [5349214.068707] [<ffffffffc0f6c400>] dnode_destroy+0x1d0/0x220 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.073940] [<ffffffffc0f6cfdb>] dnode_special_close+0x4b/0x70 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.079073] [<ffffffffc0f602fa>] dmu_objset_evict_done+0x1a/0x250 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.084120] [<ffffffffc0f60602>] dmu_objset_evict+0xd2/0xe0 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.088997] [<ffffffffc0f75ce1>] dsl_dataset_clone_swap_sync_impl+0x151/0x880 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.098510] [<ffffffffc0f52666>] ? dmu_buf_rele+0x36/0x40 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.103383] [<ffffffffc0f7e991>] ? dsl_dir_rele+0x31/0x40 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.108086] [<ffffffffc0f74248>] ? dsl_dataset_hold+0x118/0x240 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.112679] [<ffffffffc045a9d8>] ? spl_kmem_cache_free+0x138/0x1c0 [spl]
Sep 6 18:38:57 10.16.110.66 [5349214.117230] [<ffffffffc0f63dcf>] dmu_recv_end_sync+0xbf/0x490 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.121801] [<ffffffffc0f964f0>] ? rrw_enter_write+0x90/0xa0 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.126332] [<ffffffffc0f8b562>] dsl_sync_task_sync+0x112/0x120 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.130830] [<ffffffffc0f82f6d>] dsl_pool_sync+0x2cd/0x400 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.135204] [<ffffffffc0fa0978>] spa_sync+0x3e8/0xd40 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.139442] [<ffffffff810ab0d2>] ? default_wake_function+0x12/0x20
Sep 6 18:38:57 10.16.110.66 [5349214.143656] [<ffffffffc0fb321a>] txg_sync_thread+0x2ca/0x490 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.147770] [<ffffffffc0fb2f50>] ? txg_delay+0x150/0x150 [zfs]
Sep 6 18:38:57 10.16.110.66 [5349214.151744] [<ffffffffc045b380>] ? __thread_exit+0x20/0x20 [spl]
Sep 6 18:38:57 10.16.110.66 [5349214.155619] [<ffffffffc045b3f3>] thread_generic_wrapper+0x73/0x80 [spl]
Sep 6 18:38:57 10.16.110.66 [5349214.159427] [<ffffffff8109f138>] kthread+0xd8/0xf0
Sep 6 18:38:57 10.16.110.66 [5349214.163113] [<ffffffff8109f060>] ? kthread_park+0x60/0x60
Sep 6 18:38:57 10.16.110.66 [5349214.166840] [<ffffffff81819f35>] ret_from_fork+0x55/0x80
Sep 6 18:38:57 10.16.110.66 [5349214.170529] [<ffffffff8109f060>] ? kthread_park+0x60/0x60
Sep 6 18:38:57 10.16.110.66 [5349214.174247] Kernel Offset: disabled
Sep 6 18:38:57 10.16.110.66 [5349214.251733] ---[ end Kernel panic - not syncing: VERIFY3(((_attribute__((unused)) typeof((&dn->dn_mtx)->m_owner) _)&((&dn->dn_mtx)->m_owner); }))) == ((void *)0)) failed (ffff883e7cbe3800 == (null))
[5349214.251733]
We are seeing this at Datto using 0.7.8. Anybody seeing it in 0.8.x?
Hello @loyou we also face this issue with 0.7.8.x. How did you get around the problem - upgrade to zfs 0.8.x ?
no upgrade util now, I made this workaround and works fine.
static void
dnode_dest(void *arg, void *unused)
{
int i;
dnode_t *dn = arg;
+ struct task_struct *t;
rw_destroy(&dn->dn_struct_rwlock);
+ while ((t = mutex_owner(&dn->dn_mtx)) != NULL) {
+ printk("%s mutex held by %s:%d, waiting for released\n",
+ __func__, t->comm, t->pid);
+ schedule_timeout_interruptible(msecs_to_jiffies(1000));
+ }
...
Thank you for this advice @loyou. Did the printk output confirm that it was dbuf_evict_thread that was holding the dn_mtx when dnode_special_close --> dnode_destroy --> dnode_dest occurred on the same dnode?
yes, I think so, the workaround patch makes sure nobody is holding dn->dn_mtx before destroy.
System information
Describe the problem you're observing
Server get kernel panic and crashed.
Describe how to reproduce the problem
Not sure.
Include any warning/errors/backtraces from the system logs
dn->dn_mtx 's owner is not NULL caused this panic.
check its holder,