Open jspiros opened 7 years ago
This seems similar to #2025, at least insofar as the second log above references dmu_objset_userquota_get_ids
right before the zfs_space_delta_cb
that panics. A comment on #2025 also seems to suggest that this may be related to #2010 given that a very recent (relative to when this occurred) timestamp seems to be appearing unexpectedly in sa.sa_magic
.
I deleted the snapshots of this filesystem that were created after the initial panic, just in case the data causing the problem was created at the first panic. But, that didn't help, I'm still running into this problem with older snapshots.
Testing against poolA
, which is imported on a separate machine that I can afford to have panic, I think I have this panic somewhat reproducable by attempting to remove a particular set of files. When I attempt to rm
them, and it happens, the timestamp that shouldn't be in sa.sa_magic (and thus causes the panic) is generally within a second of the moment the panic occurs.
I haven't narrowed it down to a precise file, as I'm not entirely sure of the best way to go about that. Given @avg-I's theory about why this may occur, I'm not even sure that it would even be a particular file causing trouble, especially since the timestamp would correspond to the removal and not some stale corrupt data from the past. So, maybe it's just a race condition happening at some point when many files are being removed very quickly on this filesystem?
My old pool, that these filesystems were sent from, is destroyed. I did save the zpool history
of that pool, and did not see any zfs upgrade
commands in it (though, the history is not complete, as the pool was originally created under zfs-fuse, and earlier versions of ZOL, which apparently didn't keep history). I don't know what version the filesystem had on the original pool, but it is v5 on poolA and poolB, which still exists.
I'm trying to figure out a way to get poolB up and running and fully functional without panics on my production machine. I'm walking through the code to see if there's a workaround that I could compile into a custom build until the underlying problem is solved. If @avg-I's theory is correct, I'm thinking it might be safe to disable the zfs_sa_upgrade
callback by commenting out the two sa_register_update_callback(os, zfs_sa_upgrade);
lines in zfs_vfsops.c
(lines 976 and 2028 in the version of zfs_vfsops.c
in master) to stop the race condition from occuring. It seems like everything else should still be compatible with the older structure for the time being. But, I still need to read the code some more to ensure that nothing else assumes the upgrade callback is in place and running on otherwise compatible (version 5) filesystems before I'll try it. This related comment in the code encourages me, though.
Also, if @avg-I's theory is correct, it's probably safe to assume that my original pool's filesystems in question were not version 5, as I was running the same version of the module with that pool and didn't run into any of these panics until switching to the new pool with the old filesystems sent over. Though that would mean that receive doesn't retain the original filesystem version on the destination pool as part of a replication like how it retains other filesystem properties.
If anyone has any suggestions for how to fix the original problem or how I might work around it or whether my idea for how to work around it might work, please let me know!
I just hit this as well. I'm testing ZFS and Lustre to see if I can upgrade, with this pointing to 'no'.
ZFS: 0.7.0-1 Lustre: 2.10.0-1 OS: CentOS 7.3.1611 Kernel: 3.10.0-514.26.1.el7.x86_64
Test: Mounted Lustre on single remote client. Ran both 'fio' and 'mdtest' concurrently in 2 loops.
fio --size=200G --rw=rw --bs=128k --numjobs=16 --iodepth=16 --ioengine=libaio --direct=0 --invalidate=1 --fsync_on_close=1 --norandommap --name test1 --filename=blah --group_reporting
mpirun --allow-run-as-root -n 8 ./mdtest -z10 -n1250 -i5 -R13 -N1
[ 6004.896506] VERIFY3(sa.sa_magic == 0x2F505A) failed (24 == 3100762)
[ 6004.903567] PANIC at zfs_vfsops.c:583:zfs_space_delta_cb()
[ 6004.909720] Showing stack for process 130238
[ 6004.909724] CPU: 5 PID: 130238 Comm: mdt01_006 Tainted: P OE ------------ 3.10.0-514.26.1.el7.x86_64 #1
[ 6004.909725] Hardware name: Dell Inc. PowerEdge C6320/082F9M, BIOS 2.4.2 01/09/2017
[ 6004.909726] ffffffffa18ad43b 0000000028956b86 ffff8826264fb420 ffffffff81687233
[ 6004.909729] ffff8826264fb430 ffffffffa027c234 ffff8826264fb5b8 ffffffffa027c309
[ 6004.909730] 000000030000000e ffff882c00000030 ffff8826264fb5c8 ffff8826264fb568
[ 6004.909732] Call Trace:
[ 6004.909749] [<ffffffff81687233>] dump_stack+0x19/0x1b
[ 6004.909766] [<ffffffffa027c234>] spl_dumpstack+0x44/0x50 [spl]
[ 6004.909771] [<ffffffffa027c309>] spl_panic+0xc9/0x110 [spl]
[ 6004.909776] [<ffffffff811de800>] ? kmem_cache_open+0x4c0/0x4d0
[ 6004.909780] [<ffffffffa0278319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[ 6004.909786] [<ffffffff8168aa82>] ? mutex_lock+0x12/0x2f
[ 6004.909814] [<ffffffffa174d191>] ? dbuf_find+0x141/0x150 [zfs]
[ 6004.909819] [<ffffffffa0278319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[ 6004.909821] [<ffffffff8168aa82>] ? mutex_lock+0x12/0x2f
[ 6004.909828] [<ffffffff810eacce>] ? getrawmonotonic64+0x2e/0xc0
[ 6004.909849] [<ffffffffa176c40d>] ? dmu_zfetch+0x40d/0x4a0 [zfs]
[ 6004.909854] [<ffffffffa027851d>] ? spl_kmem_cache_free+0x14d/0x1d0 [spl]
[ 6004.909882] [<ffffffffa17f1afc>] zfs_space_delta_cb+0x9c/0x200 [zfs]
[ 6004.909897] [<ffffffffa175f6ae>] dmu_objset_userquota_get_ids+0x13e/0x3e0 [zfs]
[ 6004.909913] [<ffffffffa176df2e>] dnode_setdirty+0x3e/0x120 [zfs]
[ 6004.909929] [<ffffffffa176e2f6>] dnode_allocate+0x186/0x220 [zfs]
[ 6004.909943] [<ffffffffa175bb1e>] dmu_object_alloc_dnsize+0x29e/0x360 [zfs]
[ 6004.909964] [<ffffffffa17d2b92>] zap_create_flags_dnsize+0x42/0xc0 [zfs]
[ 6004.909973] [<ffffffffa087f9d8>] __osd_zap_create+0x88/0x100 [osd_zfs]
[ 6004.909978] [<ffffffffa087fc27>] osd_mkdir+0x97/0x140 [osd_zfs]
[ 6004.909981] [<ffffffffa087ed12>] osd_create+0x2b2/0x9d0 [osd_zfs]
[ 6004.909986] [<ffffffffa02771a0>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
[ 6004.910005] [<ffffffffa112d3b5>] lod_sub_create+0x1f5/0x480 [lod]
[ 6004.910012] [<ffffffffa1122419>] lod_create+0x69/0x2c0 [lod]
[ 6004.910025] [<ffffffffa118e9d5>] mdd_create_object_internal+0xb5/0x280 [mdd]
[ 6004.910030] [<ffffffffa117a355>] mdd_create_object+0x75/0xb80 [mdd]
[ 6004.910034] [<ffffffffa11805a8>] ? mdd_declare_create+0x578/0xe20 [mdd]
[ 6004.910038] [<ffffffffa11843ce>] mdd_create+0xd2e/0x1330 [mdd]
[ 6004.910056] [<ffffffffa1073d56>] mdt_create+0x846/0xbb0 [mdt]
[ 6004.910087] [<ffffffffa0c880e4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
[ 6004.910098] [<ffffffffa0c87d5d>] ? lprocfs_stats_unlock+0x3d/0x50 [obdclass]
[ 6004.910106] [<ffffffffa107422b>] mdt_reint_create+0x16b/0x350 [mdt]
[ 6004.910114] [<ffffffffa1075730>] mdt_reint_rec+0x80/0x210 [mdt]
[ 6004.910120] [<ffffffffa10572fb>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[ 6004.910127] [<ffffffffa1062e37>] mdt_reint+0x67/0x140 [mdt]
[ 6004.910185] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6004.910208] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6004.910230] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6004.910235] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6004.910239] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6004.910260] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6004.910280] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6004.910284] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6004.910286] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6004.910289] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6004.910292] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6205.238451] LNet: Service thread pid 68548 was inactive for 200.28s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 6205.238457] Pid: 68548, comm: mdt01_000
[ 6205.238459]
Call Trace:
[ 6205.238475] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6205.238480] [<ffffffff8168e2a5>] rwsem_down_read_failed+0xf5/0x170
[ 6205.238485] [<ffffffff81327718>] call_rwsem_down_read_failed+0x18/0x30
[ 6205.238490] [<ffffffff8168ba80>] down_read+0x20/0x30
[ 6205.238558] [<ffffffffa175ac90>] dmu_object_info_from_dnode+0x20/0x90 [zfs]
[ 6205.238591] [<ffffffffa175ad45>] dmu_object_info+0x45/0x80 [zfs]
[ 6205.238622] [<ffffffffa175b7ef>] dmu_object_next+0x7f/0x110 [zfs]
[ 6205.238652] [<ffffffffa175b9f4>] dmu_object_alloc_dnsize+0x174/0x360 [zfs]
[ 6205.238659] [<ffffffff810eacce>] ? getrawmonotonic64+0x2e/0xc0
[ 6205.238670] [<ffffffff811dbdd9>] ? ___slab_alloc+0x209/0x4f0
[ 6205.238676] [<ffffffff81333b8d>] ? list_del+0xd/0x30
[ 6205.238732] [<ffffffffa17d2b92>] zap_create_flags_dnsize+0x42/0xc0 [zfs]
[ 6205.238746] [<ffffffffa087f9d8>] __osd_zap_create+0x88/0x100 [osd_zfs]
[ 6205.238756] [<ffffffffa087fc27>] osd_mkdir+0x97/0x140 [osd_zfs]
[ 6205.238764] [<ffffffffa087ed12>] osd_create+0x2b2/0x9d0 [osd_zfs]
[ 6205.238790] [<ffffffffa112d3b5>] lod_sub_create+0x1f5/0x480 [lod]
[ 6205.238804] [<ffffffffa1122419>] lod_create+0x69/0x2c0 [lod]
[ 6205.238823] [<ffffffffa118e9d5>] mdd_create_object_internal+0xb5/0x280 [mdd]
[ 6205.238833] [<ffffffffa117a355>] mdd_create_object+0x75/0xb80 [mdd]
[ 6205.238843] [<ffffffffa11805a8>] ? mdd_declare_create+0x578/0xe20 [mdd]
[ 6205.238853] [<ffffffffa11843ce>] mdd_create+0xd2e/0x1330 [mdd]
[ 6205.238882] [<ffffffffa1073d56>] mdt_create+0x846/0xbb0 [mdt]
[ 6205.238930] [<ffffffffa0c880e4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
[ 6205.238955] [<ffffffffa0c87d5d>] ? lprocfs_stats_unlock+0x3d/0x50 [obdclass]
[ 6205.238972] [<ffffffffa107422b>] mdt_reint_create+0x16b/0x350 [mdt]
[ 6205.238988] [<ffffffffa1075730>] mdt_reint_rec+0x80/0x210 [mdt]
[ 6205.239001] [<ffffffffa10572fb>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[ 6205.239016] [<ffffffffa1062e37>] mdt_reint+0x67/0x140 [mdt]
[ 6205.239115] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6205.239167] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6205.239215] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6205.239222] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6205.239228] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6205.239275] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6205.239320] [<ffffffffa0e906f0>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc]
[ 6205.239326] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6205.239330] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6205.239334] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6205.239338] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6205.239344] LustreError: dumping log to /tmp/lustre-log.1501847128.68548
[ 6205.455845] Pid: 130238, comm: mdt01_006
[ 6205.455848]
Call Trace:
[ 6205.455868] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6205.455888] [<ffffffffa027c335>] spl_panic+0xf5/0x110 [spl]
[ 6205.455892] [<ffffffff811de800>] ? kmem_cache_open+0x4c0/0x4d0
[ 6205.455896] [<ffffffffa0278319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[ 6205.455899] [<ffffffff8168aa82>] ? mutex_lock+0x12/0x2f
[ 6205.455940] [<ffffffffa174d191>] ? dbuf_find+0x141/0x150 [zfs]
[ 6205.455944] [<ffffffffa0278319>] ? spl_kmem_cache_alloc+0x99/0x150 [spl]
[ 6205.455946] [<ffffffff8168aa82>] ? mutex_lock+0x12/0x2f
[ 6205.455954] [<ffffffff810eacce>] ? getrawmonotonic64+0x2e/0xc0
[ 6205.455971] [<ffffffffa176c40d>] ? dmu_zfetch+0x40d/0x4a0 [zfs]
[ 6205.455976] [<ffffffffa027851d>] ? spl_kmem_cache_free+0x14d/0x1d0 [spl]
[ 6205.456006] [<ffffffffa17f1afc>] zfs_space_delta_cb+0x9c/0x200 [zfs]
[ 6205.456022] [<ffffffffa175f6ae>] dmu_objset_userquota_get_ids+0x13e/0x3e0 [zfs]
[ 6205.456040] [<ffffffffa176df2e>] dnode_setdirty+0x3e/0x120 [zfs]
[ 6205.456057] [<ffffffffa176e2f6>] dnode_allocate+0x186/0x220 [zfs]
[ 6205.456073] [<ffffffffa175bb1e>] dmu_object_alloc_dnsize+0x29e/0x360 [zfs]
[ 6205.456112] [<ffffffffa17d2b92>] zap_create_flags_dnsize+0x42/0xc0 [zfs]
[ 6205.456121] [<ffffffffa087f9d8>] __osd_zap_create+0x88/0x100 [osd_zfs]
[ 6205.456124] [<ffffffffa087fc27>] osd_mkdir+0x97/0x140 [osd_zfs]
[ 6205.456128] [<ffffffffa087ed12>] osd_create+0x2b2/0x9d0 [osd_zfs]
[ 6205.456132] [<ffffffffa02771a0>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
[ 6205.456157] [<ffffffffa112d3b5>] lod_sub_create+0x1f5/0x480 [lod]
[ 6205.456162] [<ffffffffa1122419>] lod_create+0x69/0x2c0 [lod]
[ 6205.456179] [<ffffffffa118e9d5>] mdd_create_object_internal+0xb5/0x280 [mdd]
[ 6205.456183] [<ffffffffa117a355>] mdd_create_object+0x75/0xb80 [mdd]
[ 6205.456186] [<ffffffffa11805a8>] ? mdd_declare_create+0x578/0xe20 [mdd]
[ 6205.456190] [<ffffffffa11843ce>] mdd_create+0xd2e/0x1330 [mdd]
[ 6205.456211] [<ffffffffa1073d56>] mdt_create+0x846/0xbb0 [mdt]
[ 6205.456248] [<ffffffffa0c880e4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
[ 6205.456257] [<ffffffffa0c87d5d>] ? lprocfs_stats_unlock+0x3d/0x50 [obdclass]
[ 6205.456264] [<ffffffffa107422b>] mdt_reint_create+0x16b/0x350 [mdt]
[ 6205.456270] [<ffffffffa1075730>] mdt_reint_rec+0x80/0x210 [mdt]
[ 6205.456275] [<ffffffffa10572fb>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[ 6205.456280] [<ffffffffa1062e37>] mdt_reint+0x67/0x140 [mdt]
[ 6205.456354] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6205.456378] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6205.456402] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6205.456406] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6205.456420] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6205.456459] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6205.456481] [<ffffffffa0e906f0>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc]
[ 6205.456485] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6205.456486] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6205.456489] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6205.456491] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6221.111947] LNet: Service thread pid 156609 was inactive for 200.04s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 6221.111953] LNet: Skipped 1 previous similar message
[ 6221.111956] Pid: 156609, comm: ll_ost_io01_012
[ 6221.111958]
Call Trace:
[ 6221.111967] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6221.111985] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6221.111992] [<ffffffff810b1b20>] ? autoremove_wake_function+0x0/0x40
[ 6221.112004] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6221.112044] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6221.112082] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6221.112091] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6221.112103] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6221.112114] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6221.112125] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6221.112192] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6221.112257] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6221.112261] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6221.112265] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6221.112269] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6221.112312] [<ffffffffa0e3aa70>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
[ 6221.112374] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6221.112428] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6221.112480] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6221.112485] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6221.112489] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6221.112540] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6221.112589] [<ffffffffa0e906f0>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc]
[ 6221.112594] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6221.112598] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6221.112601] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6221.112605] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6221.112611] LustreError: dumping log to /tmp/lustre-log.1501847144.156609
[ 6221.112820] Pid: 99052, comm: ll_ost_io01_008
[ 6221.112822]
Call Trace:
[ 6221.112828] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6221.112842] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6221.112847] [<ffffffff810b1b20>] ? autoremove_wake_function+0x0/0x40
[ 6221.112859] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6221.112912] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6221.112952] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6221.112966] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6221.112982] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6221.113000] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6221.113010] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6221.113075] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6221.113137] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6221.113142] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6221.113146] [<ffffffff810caebe>] ? account_entity_dequeue+0xae/0xd0
[ 6221.113150] [<ffffffff810ce9ac>] ? dequeue_entity+0x11c/0x5d0
[ 6221.113192] [<ffffffffa0e3aa70>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
[ 6221.113252] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6221.113305] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6221.113355] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6221.113360] [<ffffffff810c54e0>] ? default_wake_function+0x0/0x20
[ 6221.113411] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6221.113460] [<ffffffffa0e906f0>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc]
[ 6221.113465] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6221.113469] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6221.113473] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6221.113477] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6221.623994] Pid: 3775, comm: ll_ost_io01_004
[ 6221.623998]
Call Trace:
[ 6221.624007] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6221.624023] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6221.624030] [<ffffffff810b1b20>] ? autoremove_wake_function+0x0/0x40
[ 6221.624041] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6221.624078] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6221.624112] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6221.624121] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6221.624131] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6221.624141] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6221.624151] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6221.624212] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6221.624272] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6221.624276] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6221.624279] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6221.624283] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6221.624288] [<ffffffff811dbdd9>] ? ___slab_alloc+0x209/0x4f0
[ 6221.624327] [<ffffffffa0e3aa70>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
[ 6221.624384] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6221.624434] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6221.624483] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6221.624487] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6221.624492] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6221.624538] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6221.624584] [<ffffffffa0e906f0>] ? ptlrpc_main+0x0/0x1dd0 [ptlrpc]
[ 6221.624588] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6221.624592] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6221.624595] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6221.624599] [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 6221.624604] LustreError: dumping log to /tmp/lustre-log.1501847145.3775
[ 6221.624759] LNet: Service thread pid 156436 was inactive for 200.35s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[ 6222.136050] LNet: Service thread pid 23404 was inactive for 200.36s. Watchdog stack traces are limited to 3 per 300 seconds, skipping this one.
[ 6242.256995] INFO: task txg_quiesce:32802 blocked for more than 120 seconds.
[ 6242.257000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.257003] txg_quiesce D 0000000000000000 0 32802 2 0x00000000
[ 6242.257009] ffff8840354d7d60 0000000000000046 ffff88200726bec0 ffff8840354d7fd8
[ 6242.257014] ffff8840354d7fd8 ffff8840354d7fd8 ffff88200726bec0 ffff880fd9620b28
[ 6242.257017] ffff880fd9620af8 ffff880fd9620b30 ffff880fd9620b20 0000000000000000
[ 6242.257021] Call Trace:
[ 6242.257037] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.257070] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.257096] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.257107] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.257174] [<ffffffffa17b4973>] txg_quiesce_thread+0x2a3/0x3a0 [zfs]
[ 6242.257222] [<ffffffffa17b46d0>] ? txg_fini+0x290/0x290 [zfs]
[ 6242.257234] [<ffffffffa0278f91>] thread_generic_wrapper+0x71/0x80 [spl]
[ 6242.257243] [<ffffffffa0278f20>] ? __thread_exit+0x20/0x20 [spl]
[ 6242.257248] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.257253] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.257258] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.257262] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.257280] INFO: task mdt01_000:68548 blocked for more than 120 seconds.
[ 6242.257282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.257283] mdt01_000 D 0000000000000001 0 68548 2 0x00000000
[ 6242.257287] ffff88102e5cf528 0000000000000046 ffff880fda298fb0 ffff88102e5cffd8
[ 6242.257290] ffff88102e5cffd8 ffff88102e5cffd8 ffff880fda298fb0 ffff880fda298fb0
[ 6242.257294] ffff88291ff430c0 fffffffeffffffff ffff88291ff430c8 0000000000000001
[ 6242.257297] Call Trace:
[ 6242.257302] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.257306] [<ffffffff8168e2a5>] rwsem_down_read_failed+0xf5/0x170
[ 6242.257312] [<ffffffff81327718>] call_rwsem_down_read_failed+0x18/0x30
[ 6242.257317] [<ffffffff8168ba80>] down_read+0x20/0x30
[ 6242.257352] [<ffffffffa175ac90>] dmu_object_info_from_dnode+0x20/0x90 [zfs]
[ 6242.257383] [<ffffffffa175ad45>] dmu_object_info+0x45/0x80 [zfs]
[ 6242.257413] [<ffffffffa175b7ef>] dmu_object_next+0x7f/0x110 [zfs]
[ 6242.257443] [<ffffffffa175b9f4>] dmu_object_alloc_dnsize+0x174/0x360 [zfs]
[ 6242.257449] [<ffffffff810eacce>] ? getrawmonotonic64+0x2e/0xc0
[ 6242.257456] [<ffffffff811dbdd9>] ? ___slab_alloc+0x209/0x4f0
[ 6242.257461] [<ffffffff81333b8d>] ? list_del+0xd/0x30
[ 6242.257510] [<ffffffffa17d2b92>] zap_create_flags_dnsize+0x42/0xc0 [zfs]
[ 6242.257524] [<ffffffffa087f9d8>] __osd_zap_create+0x88/0x100 [osd_zfs]
[ 6242.257534] [<ffffffffa087fc27>] osd_mkdir+0x97/0x140 [osd_zfs]
[ 6242.257543] [<ffffffffa087ed12>] osd_create+0x2b2/0x9d0 [osd_zfs]
[ 6242.257565] [<ffffffffa112d3b5>] lod_sub_create+0x1f5/0x480 [lod]
[ 6242.257580] [<ffffffffa1122419>] lod_create+0x69/0x2c0 [lod]
[ 6242.257598] [<ffffffffa118e9d5>] mdd_create_object_internal+0xb5/0x280 [mdd]
[ 6242.257608] [<ffffffffa117a355>] mdd_create_object+0x75/0xb80 [mdd]
[ 6242.257618] [<ffffffffa11805a8>] ? mdd_declare_create+0x578/0xe20 [mdd]
[ 6242.257629] [<ffffffffa11843ce>] mdd_create+0xd2e/0x1330 [mdd]
[ 6242.257654] [<ffffffffa1073d56>] mdt_create+0x846/0xbb0 [mdt]
[ 6242.257698] [<ffffffffa0c880e4>] ? lprocfs_stats_lock+0x24/0xd0 [obdclass]
[ 6242.257724] [<ffffffffa0c87d5d>] ? lprocfs_stats_unlock+0x3d/0x50 [obdclass]
[ 6242.257742] [<ffffffffa107422b>] mdt_reint_create+0x16b/0x350 [mdt]
[ 6242.257758] [<ffffffffa1075730>] mdt_reint_rec+0x80/0x210 [mdt]
[ 6242.257772] [<ffffffffa10572fb>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
[ 6242.257787] [<ffffffffa1062e37>] mdt_reint+0x67/0x140 [mdt]
[ 6242.257877] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.257933] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.258009] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.258028] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6242.258047] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6242.258106] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.258167] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.258185] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.258201] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.258216] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.258226] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.258239] INFO: task ll_ost_io01_003:73306 blocked for more than 120 seconds.
[ 6242.258241] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.258242] ll_ost_io01_003 D 0000000000000000 0 73306 2 0x00000000
[ 6242.258247] ffff881fad7af878 0000000000000046 ffff880fcfb5ce70 ffff881fad7affd8
[ 6242.258252] ffff881fad7affd8 ffff881fad7affd8 ffff880fcfb5ce70 ffff88102fb71978
[ 6242.258255] ffff88102fb71948 ffff88102fb71980 ffff88102fb71970 0000000000000000
[ 6242.258258] Call Trace:
[ 6242.258264] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.258279] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.258284] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.258298] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.258335] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6242.258383] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6242.258400] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6242.258413] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6242.258425] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6242.258440] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6242.258503] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6242.258563] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6242.258569] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6242.258573] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6242.258577] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6242.258617] [<ffffffffa0e3aa70>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[ 6242.258674] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.258726] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.258775] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.258781] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6242.258786] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6242.258832] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.258880] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.258886] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.258897] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.258901] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.258905] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.258909] INFO: task ll_ost_io01_004:3775 blocked for more than 120 seconds.
[ 6242.258910] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.258912] ll_ost_io01_004 D 0000000000000000 0 3775 2 0x00000000
[ 6242.258915] ffff880f42347878 0000000000000046 ffff881fb308ce70 ffff880f42347fd8
[ 6242.258919] ffff880f42347fd8 ffff880f42347fd8 ffff881fb308ce70 ffff88102fb71978
[ 6242.258922] ffff88102fb71948 ffff88102fb71980 ffff88102fb71970 0000000000000000
[ 6242.258926] Call Trace:
[ 6242.258932] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.258952] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.258968] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.258991] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.259039] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6242.259075] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6242.259090] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6242.259101] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6242.259112] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6242.259122] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6242.259183] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6242.259243] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6242.259249] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6242.259253] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6242.259256] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6242.259261] [<ffffffff811dbdd9>] ? ___slab_alloc+0x209/0x4f0
[ 6242.259301] [<ffffffffa0e3aa70>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[ 6242.259358] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.259408] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.259458] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.259465] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6242.259470] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6242.259516] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.259564] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.259571] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.259576] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.259579] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.259583] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.259588] INFO: task ll_ost_io01_006:23404 blocked for more than 120 seconds.
[ 6242.259589] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.259591] ll_ost_io01_006 D 0000000000000000 0 23404 2 0x00000000
[ 6242.259594] ffff880f408af878 0000000000000046 ffff8830371c0000 ffff880f408affd8
[ 6242.259597] ffff880f408affd8 ffff880f408affd8 ffff8830371c0000 ffff88102fb71978
[ 6242.259603] ffff88102fb71948 ffff88102fb71980 ffff88102fb71970 0000000000000000
[ 6242.259606] Call Trace:
[ 6242.259611] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.259624] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.259630] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.259641] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.259678] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6242.259716] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6242.259727] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6242.259737] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6242.259749] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6242.259759] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6242.259821] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6242.259881] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6242.259887] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6242.259897] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6242.259900] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6242.259905] [<ffffffff811dbdd9>] ? ___slab_alloc+0x209/0x4f0
[ 6242.259945] [<ffffffffa0e3aa70>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[ 6242.260016] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.260076] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.260136] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.260153] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6242.260170] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6242.260226] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.260281] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.260288] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.260293] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.260296] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.260301] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.260304] INFO: task ll_ost_io01_008:99052 blocked for more than 120 seconds.
[ 6242.260306] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.260308] ll_ost_io01_008 D 0000000000000000 0 99052 2 0x00000000
[ 6242.260311] ffff88303631b878 0000000000000046 ffff882f67890000 ffff88303631bfd8
[ 6242.260314] ffff88303631bfd8 ffff88303631bfd8 ffff882f67890000 ffff88102fb71978
[ 6242.260326] ffff88102fb71948 ffff88102fb71980 ffff88102fb71970 0000000000000000
[ 6242.260341] Call Trace:
[ 6242.260347] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.260363] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.260369] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.260380] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.260417] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6242.260452] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6242.260463] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6242.260473] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6242.260484] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6242.260495] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6242.260556] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6242.260614] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6242.260620] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6242.260624] [<ffffffff810caebe>] ? account_entity_dequeue+0xae/0xd0
[ 6242.260628] [<ffffffff810ce9ac>] ? dequeue_entity+0x11c/0x5d0
[ 6242.260667] [<ffffffffa0e3aa70>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[ 6242.260724] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.260775] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.260823] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.260829] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20
[ 6242.260875] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.260924] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.260937] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.260942] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.260960] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.260976] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.260989] INFO: task ll_ost_io01_009:156435 blocked for more than 120 seconds.
[ 6242.260996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.261006] ll_ost_io01_009 D 0000000000000000 0 156435 2 0x00000000
[ 6242.261010] ffff882ff913f878 0000000000000046 ffff884035e79f60 ffff882ff913ffd8
[ 6242.261013] ffff882ff913ffd8 ffff882ff913ffd8 ffff884035e79f60 ffff88102fb71978
[ 6242.261017] ffff88102fb71948 ffff88102fb71980 ffff88102fb71970 0000000000000000
[ 6242.261020] Call Trace:
[ 6242.261029] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.261042] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.261049] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.261059] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.261108] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6242.261149] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6242.261158] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6242.261168] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6242.261180] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6242.261190] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6242.261251] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6242.261311] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6242.261318] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6242.261321] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6242.261325] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6242.261364] [<ffffffffa0e3aa70>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[ 6242.261420] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.261470] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.261518] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.261525] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6242.261529] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6242.261575] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.261621] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.261628] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.261632] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.261635] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.261640] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.261644] INFO: task ll_ost_io01_010:156436 blocked for more than 120 seconds.
[ 6242.261645] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.261647] ll_ost_io01_010 D 0000000000000000 0 156436 2 0x00000000
[ 6242.261650] ffff883023f0f878 0000000000000046 ffff884035e7de20 ffff883023f0ffd8
[ 6242.261653] ffff883023f0ffd8 ffff883023f0ffd8 ffff884035e7de20 ffff88102fb71978
[ 6242.261657] ffff88102fb71948 ffff88102fb71980 ffff88102fb71970 0000000000000000
[ 6242.261662] Call Trace:
[ 6242.261667] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.261679] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.261687] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.261697] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.261735] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6242.261772] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6242.261782] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6242.261792] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6242.261804] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6242.261814] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6242.261875] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6242.261939] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6242.261949] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6242.261970] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6242.261983] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6242.262035] [<ffffffffa0e3aa70>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[ 6242.262104] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.262166] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.262224] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.262231] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6242.262236] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6242.262282] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.262337] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.262343] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.262348] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.262351] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.262355] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.262360] INFO: task ll_ost_io01_012:156609 blocked for more than 120 seconds.
[ 6242.262362] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.262363] ll_ost_io01_012 D 0000000000000000 0 156609 2 0x00000000
[ 6242.262366] ffff881e40453878 0000000000000046 ffff881f7e435e20 ffff881e40453fd8
[ 6242.262370] ffff881e40453fd8 ffff881e40453fd8 ffff881f7e435e20 ffff88102fb71978
[ 6242.262373] ffff88102fb71948 ffff88102fb71980 ffff88102fb71970 0000000000000000
[ 6242.262378] Call Trace:
[ 6242.262383] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.262396] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.262403] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.262413] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.262450] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6242.262486] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6242.262497] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6242.262507] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6242.262519] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6242.262529] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6242.262589] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6242.262648] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6242.262654] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6242.262657] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6242.262661] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6242.262700] [<ffffffffa0e3aa70>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[ 6242.262756] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.262807] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.262855] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.262861] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6242.262866] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6242.262914] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.262966] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.262983] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.263001] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.263018] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.263036] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.263047] INFO: task ll_ost_io01_017:171041 blocked for more than 120 seconds.
[ 6242.263056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6242.263057] ll_ost_io01_017 D 0000000000000000 0 171041 2 0x00000000
[ 6242.263061] ffff88077e5ff878 0000000000000046 ffff880fa0086dd0 ffff88077e5fffd8
[ 6242.263064] ffff88077e5fffd8 ffff88077e5fffd8 ffff880fa0086dd0 ffff88102fb71978
[ 6242.263069] ffff88102fb71948 ffff88102fb71980 ffff88102fb71970 0000000000000000
[ 6242.263072] Call Trace:
[ 6242.263078] [<ffffffff8168c8f9>] schedule+0x29/0x70
[ 6242.263091] [<ffffffffa027e36d>] cv_wait_common+0x10d/0x130 [spl]
[ 6242.263096] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30
[ 6242.263107] [<ffffffffa027e3a5>] __cv_wait+0x15/0x20 [spl]
[ 6242.263144] [<ffffffffa176b453>] dmu_tx_wait+0xa3/0x360 [zfs]
[ 6242.263192] [<ffffffffa176b795>] dmu_tx_assign+0x85/0x410 [zfs]
[ 6242.263207] [<ffffffffa0873eba>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
[ 6242.263217] [<ffffffffa123206b>] ofd_trans_start+0x6b/0xe0 [ofd]
[ 6242.263234] [<ffffffffa123842b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[ 6242.263244] [<ffffffffa123c2ef>] ofd_commitrw+0x4af/0xa80 [ofd]
[ 6242.263306] [<ffffffffa0f151e2>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[ 6242.263365] [<ffffffffa0ee8a70>] tgt_brw_write+0x1010/0x1740 [ptlrpc]
[ 6242.263370] [<ffffffff810ce4c4>] ? update_curr+0x104/0x190
[ 6242.263374] [<ffffffff810c9bf8>] ? __enqueue_entity+0x78/0x80
[ 6242.263377] [<ffffffff810d065c>] ? enqueue_entity+0x26c/0xb60
[ 6242.263417] [<ffffffffa0e3aa70>] ? target_send_reply_msg+0x170/0x170 [ptlrpc]
[ 6242.263473] [<ffffffffa0ee3915>] tgt_request_handle+0x915/0x1360 [ptlrpc]
[ 6242.263523] [<ffffffffa0e8d1b3>] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc]
[ 6242.263573] [<ffffffffa0e8a9a8>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 6242.263580] [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20
[ 6242.263584] [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90
[ 6242.263631] [<ffffffffa0e91190>] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc]
[ 6242.263677] [<ffffffffa0e906f0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 6242.263684] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 6242.263688] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 6242.263691] [<ffffffff81697858>] ret_from_fork+0x58/0x90
[ 6242.263695] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
`
Pool information:
`# zpool get all
NAME PROPERTY VALUE SOURCE
mdtpool size 11.4T -
mdtpool capacity 2% -
mdtpool altroot - default
mdtpool health ONLINE -
mdtpool guid 5491873849748818681 -
mdtpool version - default
mdtpool bootfs - default
mdtpool delegation on default
mdtpool autoreplace on local
mdtpool cachefile - default
mdtpool failmode wait default
mdtpool listsnapshots off default
mdtpool autoexpand off default
mdtpool dedupditto 0 default
mdtpool dedupratio 1.00x -
mdtpool free 11.1T -
mdtpool allocated 302G -
mdtpool readonly off -
mdtpool ashift 12 local
mdtpool comment - default
mdtpool expandsize - -
mdtpool freeing 288K -
mdtpool fragmentation 4% -
mdtpool leaked 0 -
mdtpool multihost off default
mdtpool feature@async_destroy enabled local
mdtpool feature@empty_bpobj active local
mdtpool feature@lz4_compress active local
mdtpool feature@multi_vdev_crash_dump enabled local
mdtpool feature@spacemap_histogram active local
mdtpool feature@enabled_txg active local
mdtpool feature@hole_birth active local
mdtpool feature@extensible_dataset active local
mdtpool feature@embedded_data active local
mdtpool feature@bookmarks enabled local
mdtpool feature@filesystem_limits enabled local
mdtpool feature@large_blocks enabled local
mdtpool feature@large_dnode active local
mdtpool feature@sha512 enabled local
mdtpool feature@skein enabled local
mdtpool feature@edonr enabled local
mdtpool feature@userobj_accounting active local
Looks liek it cut my pool information:
ZFS -
zfs get all
NAME PROPERTY VALUE SOURCE mdtpool type filesystem - mdtpool creation Thu Jul 27 11:26 2017 - mdtpool used 201G - mdtpool available 7.18T - mdtpool referenced 192K - mdtpool compressratio 1.00x - mdtpool mounted yes - mdtpool quota none default mdtpool reservation none default mdtpool recordsize 128K default mdtpool mountpoint /mdtpool default mdtpool sharenfs off default mdtpool checksum on default mdtpool compression off local mdtpool atime off local mdtpool devices on default mdtpool exec on default mdtpool setuid on default mdtpool readonly off default mdtpool zoned off default mdtpool snapdir hidden default mdtpool aclinherit restricted default mdtpool createtxg 1 - mdtpool canmount on default mdtpool xattr sa local mdtpool copies 1 default mdtpool version 5 - mdtpool utf8only off - mdtpool normalization none - mdtpool casesensitivity sensitive - mdtpool vscan off default mdtpool nbmand off default mdtpool sharesmb off default mdtpool refquota none default mdtpool refreservation none default mdtpool guid 10216788818702012875 - mdtpool primarycache all default mdtpool secondarycache all default mdtpool usedbysnapshots 0B - mdtpool usedbydataset 192K - mdtpool usedbychildren 201G - mdtpool usedbyrefreservation 0B - mdtpool logbias latency default mdtpool dedup off default mdtpool mlslabel none default mdtpool sync standard default mdtpool dnodesize auto local mdtpool refcompressratio 1.00x - mdtpool written 192K - mdtpool logicalused 200G - mdtpool logicalreferenced 40K - mdtpool volmode default default mdtpool filesystem_limit none default mdtpool snapshot_limit none default mdtpool filesystem_count none default mdtpool snapshot_count none default mdtpool snapdev hidden default mdtpool acltype off default mdtpool context none default mdtpool fscontext none default mdtpool defcontext none default mdtpool rootcontext none default mdtpool relatime off local mdtpool redundant_metadata all default mdtpool overlay off default mdtpool/mdt type filesystem - mdtpool/mdt creation Fri Aug 4 5:17 2017 - mdtpool/mdt used 689M - mdtpool/mdt available 7.18T - mdtpool/mdt referenced 689M - mdtpool/mdt compressratio 1.00x - mdtpool/mdt mounted no - mdtpool/mdt quota none default mdtpool/mdt reservation none default mdtpool/mdt recordsize 128K default mdtpool/mdt mountpoint /mdtpool/mdt default mdtpool/mdt sharenfs off default mdtpool/mdt checksum on default mdtpool/mdt compression off inherited from mdtpool mdtpool/mdt atime off inherited from mdtpool mdtpool/mdt devices on default mdtpool/mdt exec on default mdtpool/mdt setuid on default mdtpool/mdt readonly off default mdtpool/mdt zoned off default mdtpool/mdt snapdir hidden default mdtpool/mdt aclinherit restricted default mdtpool/mdt createtxg 264468 - mdtpool/mdt canmount off local mdtpool/mdt xattr sa local mdtpool/mdt copies 1 default mdtpool/mdt version 5 - mdtpool/mdt utf8only off - mdtpool/mdt normalization none - mdtpool/mdt casesensitivity sensitive - mdtpool/mdt vscan off default mdtpool/mdt nbmand off default mdtpool/mdt sharesmb off default mdtpool/mdt refquota none default mdtpool/mdt refreservation none default mdtpool/mdt guid 13254115274950875660 - mdtpool/mdt primarycache all default mdtpool/mdt secondarycache all default mdtpool/mdt usedbysnapshots 0B - mdtpool/mdt usedbydataset 689M - mdtpool/mdt usedbychildren 0B - mdtpool/mdt usedbyrefreservation 0B - mdtpool/mdt logbias latency default mdtpool/mdt dedup off default mdtpool/mdt mlslabel none default mdtpool/mdt sync standard default mdtpool/mdt dnodesize auto inherited from mdtpool mdtpool/mdt refcompressratio 1.00x - mdtpool/mdt written 689M - mdtpool/mdt logicalused 116M - mdtpool/mdt logicalreferenced 116M - mdtpool/mdt volmode default default mdtpool/mdt filesystem_limit none default mdtpool/mdt snapshot_limit none default mdtpool/mdt filesystem_count none default mdtpool/mdt snapshot_count none default mdtpool/mdt snapdev hidden default mdtpool/mdt acltype off default mdtpool/mdt context none default mdtpool/mdt fscontext none default mdtpool/mdt defcontext none default mdtpool/mdt rootcontext none default mdtpool/mdt relatime off inherited from mdtpool mdtpool/mdt redundant_metadata all default mdtpool/mdt overlay off default mdtpool/mdt lustre:svname test-MDT0000 local mdtpool/mdt lustre:flags 37 local mdtpool/mdt lustre:index 0 local mdtpool/mdt lustre:version 1 local mdtpool/mdt lustre:fsname test local mdtpool/ost type filesystem - mdtpool/ost creation Fri Aug 4 5:18 2017 - mdtpool/ost used 200G - mdtpool/ost available 7.18T - mdtpool/ost referenced 200G - mdtpool/ost compressratio 1.00x - mdtpool/ost mounted no - mdtpool/ost quota none default mdtpool/ost reservation none default mdtpool/ost recordsize 128K default mdtpool/ost mountpoint /mdtpool/ost default mdtpool/ost sharenfs off default mdtpool/ost checksum on default mdtpool/ost compression off inherited from mdtpool mdtpool/ost atime off inherited from mdtpool mdtpool/ost devices on default mdtpool/ost exec on default mdtpool/ost setuid on default mdtpool/ost readonly off default mdtpool/ost zoned off default mdtpool/ost snapdir hidden default mdtpool/ost aclinherit restricted default mdtpool/ost createtxg 264483 - mdtpool/ost canmount off local mdtpool/ost xattr sa local mdtpool/ost copies 1 default mdtpool/ost version 5 - mdtpool/ost utf8only off - mdtpool/ost normalization none - mdtpool/ost casesensitivity sensitive - mdtpool/ost vscan off default mdtpool/ost nbmand off default mdtpool/ost sharesmb off default mdtpool/ost refquota none default mdtpool/ost refreservation none default mdtpool/ost guid 13729889103785060247 - mdtpool/ost primarycache all default mdtpool/ost secondarycache all default mdtpool/ost usedbysnapshots 0B - mdtpool/ost usedbydataset 200G - mdtpool/ost usedbychildren 0B - mdtpool/ost usedbyrefreservation 0B - mdtpool/ost logbias latency default mdtpool/ost dedup off default mdtpool/ost mlslabel none default mdtpool/ost sync standard default mdtpool/ost dnodesize auto inherited from mdtpool mdtpool/ost refcompressratio 1.00x - mdtpool/ost written 200G - mdtpool/ost logicalused 200G - mdtpool/ost logicalreferenced 200G - mdtpool/ost volmode default default mdtpool/ost filesystem_limit none default mdtpool/ost snapshot_limit none default mdtpool/ost filesystem_count none default mdtpool/ost snapshot_count none default mdtpool/ost snapdev hidden default mdtpool/ost acltype off default mdtpool/ost context none default mdtpool/ost fscontext none default mdtpool/ost defcontext none default mdtpool/ost rootcontext none default mdtpool/ost relatime off inherited from mdtpool mdtpool/ost redundant_metadata all default mdtpool/ost overlay off default mdtpool/ost lustre:svname test-OST0000 local mdtpool/ost lustre:flags 34 local mdtpool/ost lustre:mgsnode 172.30.4.22@o2ib local mdtpool/ost lustre:index 0 local mdtpool/ost lustre:version 1 local mdtpool/ost lustre:fsname test local
Rebooted and got the same error with just the mdtest loop, do not get the error with the 'ior' loops.
[ 2455.318431] VERIFY3(sa.sa_magic == 0x2F505A) failed (24 == 3100762) [ 2455.325456] PANIC at zfs_vfsops.c:583:zfs_space_delta_cb() [ 2455.331595] Showing stack for process 47626 [ 2455.331598] CPU: 3 PID: 47626 Comm: mdt01_002 Tainted: P OE ------------ 3.10.0-514.26.1.el7.x86_64 #1 [ 2455.331599] Hardware name: Dell Inc. PowerEdge C6320/082F9M, BIOS 2.4.2 01/09/2017 [ 2455.331600] ffffffffa182f43b 000000003b5e37a5 ffff880f772c7420 ffffffff81687233 [ 2455.331602] ffff880f772c7430 ffffffffa02e6234 ffff880f772c75b8 ffffffffa02e6309 [ 2455.331604] 000000030000000e ffff882f00000030 ffff880f772c75c8 ffff880f772c7568 [ 2455.331606] Call Trace: [ 2455.331622] [
] dump_stack+0x19/0x1b [ 2455.331638] [ ] spl_dumpstack+0x44/0x50 [spl] [ 2455.331642] [ ] spl_panic+0xc9/0x110 [spl] [ 2455.331646] [ ] ? kmem_cache_open+0x4c0/0x4d0 [ 2455.331650] [ ] ? spl_kmem_cache_alloc+0x99/0x150 [spl] [ 2455.331655] [ ] ? mutex_lock+0x12/0x2f [ 2455.331679] [ ] ? dbuf_find+0x141/0x150 [zfs] [ 2455.331686] [ ] ? spl_kmem_cache_alloc+0x99/0x150 [spl] [ 2455.331687] [ ] ? mutex_lock+0x12/0x2f [ 2455.331696] [ ] ? getrawmonotonic64+0x2e/0xc0 [ 2455.331715] [ ] ? dmu_zfetch+0x40d/0x4a0 [zfs] [ 2455.331719] [ ] ? spl_kmem_cache_free+0x14d/0x1d0 [spl] [ 2455.331758] [ ] zfs_space_delta_cb+0x9c/0x200 [zfs] [ 2455.331771] [ ] dmu_objset_userquota_get_ids+0x13e/0x3e0 [zfs] [ 2455.331785] [ ] dnode_setdirty+0x3e/0x120 [zfs] [ 2455.331800] [ ] dnode_allocate+0x186/0x220 [zfs] [ 2455.331812] [ ] dmu_object_alloc_dnsize+0x29e/0x360 [zfs] [ 2455.331832] [ ] zap_create_flags_dnsize+0x42/0xc0 [zfs] [ 2455.331840] [ ] __osd_zap_create+0x88/0x100 [osd_zfs] [ 2455.331844] [ ] osd_mkdir+0x97/0x140 [osd_zfs] [ 2455.331849] [ ] osd_create+0x2b2/0x9d0 [osd_zfs] [ 2455.331853] [ ] ? spl_kmem_zalloc+0xc0/0x170 [spl] [ 2455.331867] [ ] lod_sub_create+0x1f5/0x480 [lod] [ 2455.331874] [ ] lod_create+0x69/0x2c0 [lod] [ 2455.331884] [ ] mdd_create_object_internal+0xb5/0x280 [mdd] [ 2455.331887] [ ] mdd_create_object+0x75/0xb80 [mdd] [ 2455.331891] [ ] ? mdd_declare_create+0x578/0xe20 [mdd] [ 2455.331895] [ ] mdd_create+0xd2e/0x1330 [mdd] [ 2455.331909] [ ] mdt_create+0x846/0xbb0 [mdt] [ 2455.331936] [ ] ? lprocfs_stats_lock+0x24/0xd0 [obdclass] [ 2455.331947] [ ] ? lprocfs_stats_unlock+0x3d/0x50 [obdclass] [ 2455.331955] [ ] mdt_reint_create+0x16b/0x350 [mdt] [ 2455.331961] [ ] mdt_reint_rec+0x80/0x210 [mdt] [ 2455.331967] [ ] mdt_reint_internal+0x5fb/0x9c0 [mdt] [ 2455.331973] [ ] mdt_reint+0x67/0x140 [mdt] [ 2455.332040] [ ] tgt_request_handle+0x915/0x1360 [ptlrpc] [ 2455.332071] [ ] ptlrpc_server_handle_request+0x233/0xa90 [ptlrpc] [ 2455.332097] [ ] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 2455.332106] [ ] ? default_wake_function+0x12/0x20 [ 2455.332113] [ ] ? __wake_up_common+0x58/0x90 [ 2455.332135] [ ] ptlrpc_main+0xaa0/0x1dd0 [ptlrpc] [ 2455.332157] [ ] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] [ 2455.332166] [ ] kthread+0xcf/0xe0 [ 2455.332169] [ ] ? kthread_create_on_node+0x140/0x140 [ 2455.332172] [ ] ret_from_fork+0x58/0x90 [ 2455.332173] [ ] ? kthread_create_on_node+0x140/0x140
@YFLOPS You may be running into a different issue than me, as although the panic is occurring in the same place, the contents of sa.sa_magic in my case are clearly UNIX timestamps which points to a possible SA upgrade issue, but I don't think that's necessarily true for your case. You may have more luck getting help if you creating a new issue for it, especially if you note that you're using 0.7.0, as all eyes are on the new version.
@behlendorf Not sure if you've seen this report yet (ignore the three posts by @YFLOPS as that was covered in its own issue #6474). Tagging you in case you haven't as you're the person who closed #1303 and #2025 with a note to create a new issue if it reappeared in recent versions. Given @avg-I 's theory in #2025 (and referenced above in my comments) about this being a race condition related to SA upgrades, if it is correct, then this bug has been around for years and is still in the latest versions as none of the relevant code seems to have been changed since the version I'm running (0.6.5.9).
~Closing as duplicate of #7117.~
@behlendorf I'm not sure that this is a duplicate of #7117 as the value in sa.sa_magic that's causing the panic is clearly a timestamp, despite the panic otherwise looking similar. Have you or anyone checked whether @avg-I's theory in #2025 (a race condition relating to SA upgrades to fs version 5+) might be what's causing this? I've looked at the code myself and it seems to be a compelling explanation, but I don't have the experience with the codebase that you guys do.
It's my intuition that I could just disable the SA upgrading (by commenting out the zfs_sa_upgrade callback) and fix this problem in practice, but I would worry about that ruining my data now, or on future changes to ZFS expecting it to have been completed.
(I realize this ticket only became more confused as another user piggybacked on it with yet another non-timestamp-in-sa_magic panic that I believe to be unrelated.)
@jspiros I revisited the issue and it looks like I did jump the gun on this one. Despite similarities this does appear to be a different issue, I'll reopen it. The theory in #2025 does look like a plausible cause.
We've started hitting this issue too, in just the past few days.
The pool is on a relatively new server, but the filesystems were zfs sent from the old server to the new, then upgraded to zfs version 5 (from 4) as part of getting everything up to date. Trying to rsync from the live filesystem seems to tickle the bug most quickly.
One thing we found that has reduced the frequency of crashes was to zfs set atime=off on all filesystems. It hasn't crashed on rsync since I did that, but that was this morning and we won't know for sure until more time has passed.
If anyone is hitting this often, try zfs set atime=off. I'll be watching this bug closely for updates as this is affecting a production server.
EDIT: General info
CentOS Linux release 7.4.1708 (Core) Linux buffalo 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 20:13:58 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux ZFS v0.7.6-1, ZFS pool version 5000, ZFS filesystem version 5
For what it's worth, I've been experiencing this all along with rsync as my primary test, though rm -r also causes it, and that's with atime=off
.
@jspiros Well, that's disconcerting. Haven't crashed yet and have been rsyncing since this morning nearly constantly, though. (moving data to a ZFS v4 partition to try to get away from this bug since it only seems to affect ZFS v5)
Haven't tried any rm -r yet; there isn't large directory tree that we can delete on the affected filesystems right now. Still, I'll report back if we crash again with atime=off.
@flynnjFIU Given the theory on why this is happening, datasets of hundreds of thousands of tiny files would sooner exhibit the problem than datasets of fewer, larger files. My case is more the former. Indeed, I'd even considered as a solution doing a script that would replicate rm -r
, but with sleep 1
between each individual rm
, as that should make the likelihood of the race condition almost nil (at least, with atime=off
; with atime=on
it would still be much more likely). It sounds like you might be lucky so far. :)
@jspiros In the interest of being thorough, do you also have relatime=off? relatime=on could also cause atime updates on access, though they wouldn't be as frequent as atime=on.
Given the timestamp erroneously in the field, it feels like this bug definitely has to do with a race condition where a timestamp update is ending up in the wrong memory location.
Welp, the problem just recurred. So although atime=off seems to reduce the likelyhood of it, it still happens.
I shall continue migrating this filesystem to ZFS v4, along with the others on the host.
@flynnjFIU In the interest of completeness, yes, I have relatime=off
as well, and the problem still occurs.
I have a similar problem on Ubuntu 18.04 bionic with zfsonlinux from standard package sources.
Importing the pool works fine, but mounting the filesystems causes
VERIFY3(sa.sa_magic == 0x2f505a) failed (3100698 == 3100762)
PANIC at zfs_vfsops:583:zfs_space_delta_cb()
The machine still does "something", but is completely unresponsive. It has to be reset, doesn't shutdown cleanly. I disabled the auto import of the pool, rebooted, then zpool import -aN
and zfs mount
ed every filesystem manually. This way I figured out which FS was causing the error. I zfs destroy
ed the FS and restored it from backup. This way I got rid of the issue for now.
This is a bummer, how can some bad data effectively kill the entire machine?
similar here, with 0.8.2 on NixOS, apparently when mounting a FS while other operations running.
VERIFY3(sa.sa_magic == SA_MAGIC) failed (1379509391 == 3100762)
PANIC at zfs_vfsops.c:585:zfs_space_delta_cb()
Showing stack for process 3458
CPU: 13 PID: 3458 Comm: dp_sync_taskq Tainted: P O 4.19.84 #1-NixOS
Hardware name: IBM System x3650 M3 -[7945H2G]-/69Y4438, BIOS -[D6E157AUS-1.15]- 06/13/2012
Call Trace:
dump_stack+0x66/0x90
spl_panic+0xd3/0xfb [spl]
? dbuf_rele_and_unlock+0x300/0x620 [zfs]
? dbuf_sync_leaf+0x3b8/0x620 [zfs]
? cpumask_next_and+0x19/0x20
? load_balance+0x1a9/0x8d0
zfs_space_delta_cb+0x15c/0x240 [zfs]
dmu_objset_userquota_get_ids+0x19f/0x440 [zfs]
dnode_sync+0x125/0x910 [zfs]
sync_dnodes_task+0x47/0x120 [zfs]
taskq_thread+0x2ca/0x490 [spl]
? wake_up_q+0x70/0x70
? task_done+0x90/0x90 [spl]
kthread+0x112/0x130
? kthread_bind+0x30/0x30
ret_from_fork+0x35/0x40
But after reboot everything seems fine again.
I've just started noticing this problem with Ubuntu 20.04 on a Raspberry Pi 4, since doing a linux dist-upgrade which I suspected of triggering the issue, though from what I'm reading here maybe it started earlier this year (not sure exactly when), when I upgraded my zfs to version 5 (I think). I don't have any log files from before then (long story) so I can't see if I was ever getting panics before that.
I guess there's no way to switch back from 5 to 4, is there?
Symptom:
When running rsync or tree (haven't tried much else) on much of the filesystem it will pretty routinely throw a panic report in syslog/kern.log, and the rsync/tree operation will later hang solidly such that even ^C doesn't break back to the shell, and any further file operations will likely hang and lock up their shells. Often requires power cycle rather than just shutdown -r to reboot.
syslog/kern.log:
Feb 2 16:52:43 donny kernel: [ 517.355790] VERIFY3(sa.sa_magic == SA_MAGIC) failed (value == 3100762) <- value varies e.g. 1612270437, 1612270437, 1612283815 Feb 2 16:52:43 donny kernel: [ 517.362904] PANIC at zfs_vfsops.c:585:zfs_space_delta_cb() Feb 2 16:52:43 donny kernel: [ 517.368512] Showing stack for process pid <- pid varies Feb 2 16:52:43 donny kernel: [ 517.368522] CPU: 1 PID: pid Comm: dp_sync_taskq Tainted: P C OE 5.4.0-1028-raspi #31-Ubuntu <- pid as above Feb 2 16:52:43 donny kernel: [ 517.368525] Hardware name: Raspberry Pi 4 Model B Rev 1.4 (DT) Feb 2 16:52:43 donny kernel: [ 517.368527] Call trace: Feb 2 16:52:43 donny kernel: [ 517.368540] dump_backtrace+0x0/0x198 Feb 2 16:52:43 donny kernel: [ 517.368544] show_stack+0x28/0x38 Feb 2 16:52:43 donny kernel: [ 517.368549] dump_stack+0xd8/0x134 Feb 2 16:52:43 donny kernel: [ 517.368583] spl_dumpstack+0x2c/0x34 [spl] Feb 2 16:52:43 donny kernel: [ 517.368606] spl_panic+0xe0/0xf8 [spl] Feb 2 16:52:43 donny kernel: [ 517.368837] zfs_space_delta_cb+0x180/0x278 [zfs] Feb 2 16:52:43 donny kernel: [ 517.368968] dmu_objset_userquota_get_ids+0x15c/0x3b8 [zfs] Feb 2 16:52:43 donny kernel: [ 517.369094] dnode_sync+0x11c/0x568 [zfs] Feb 2 16:52:43 donny kernel: [ 517.369216] dmu_objset_sync_dnodes+0x48/0xf0 [zfs] Feb 2 16:52:43 donny kernel: [ 517.369333] sync_dnodes_task+0x34/0x58 [zfs] Feb 2 16:52:43 donny kernel: [ 517.369353] taskq_thread+0x20c/0x3b0 [spl] Feb 2 16:52:43 donny kernel: [ 517.369363] kthread+0x150/0x170 Feb 2 16:52:43 donny kernel: [ 517.369368] ret_from_fork+0x10/0x18
I've filed a bug report with ubuntu launchpad
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
This problem still exists and it's being reported from time to time in various forums.
NOTE: I have some thoughts about what may be causing this problem, and I'm not sure that my logs in this first post are immediately useful. You may want to skip down to my updates after this first post for more context and thoughts on what might be happening.
System information
Describe the problem you're observing
I have experienced two panics in a short period of time seemingly related to rsync and/or send/recv. I was running with two pools, poolA and poolB, and I was in the process of moving everything from poolA to poolB using send/recv so that I could run with poolB only. One of the filesystem hierarchies being moved is the destination of some rsync cronjobs used for backup purposes. When I was send/recv'ing that hierarchy (using
zfs send -vR poolA/backups@snap | zfs receive -uv poolB/backups
), before it completed, one of the cronjobs started to run (with poolA as the effective destination). With both the send/recv and the rsync running at the same time, I got a panic which brought everything related to ZFS or any ZFS filesystems to a halt, necessitating a reboot. This is the panic from my kern.log:Before I rebooted, I confirmed that both the rsync and the zfs commands were still running. I was logging the output of the zfs commands, and it did log the final "receiving incremental stream" line relating to the last snapshot of the last filesystem that was to be sent per the initial estimate when the commands first started. But, the processes did not exit.
After reboot, it looked like the final snapshot of the final filesystem was indeed on poolB, though
zpool history
for poolA or poolB did not show the send/recv commands.Eventually I did the final move by creating new snapshots with everything unmounted, did an incremental update with
zfs send -vRI @snap poolA/backups@final-snap | zfs receive -uFv poolB/backups
. After this completed, I again confirmed (usingzfs list
) that everything was showing up on the poolB side, before exporting poolA, and reconfiguring mounts and rebooting the system to run only with poolB.After the system was running with poolB only for a while, one of the rsync backup cronjobs ran, with a destination in the same hierarchy of filesystems that I ran into the first panic with (but this time with poolB as the actual pool involved), and I got another panic. This time, there was no send/recv taking place at the same time.
There are other processes appearing in this log, like dropbox, youtube-dl, etc., and those are all processes that were interacting with other ZFS filesystems at the time. But, those same processes have been running with no issue otherwise, so I suspect it's related to rsync and the filesystems I had issues with during the send/recv process.
This is a production system, and since the backups aren't as essential as the other processes that depend on other filesystems on this system, and I hoped that the issue was with something in those filesystems that were send/recv'ing during the first panic, I decided to try and avoid further panics by simply unmounting the entire suspect filesystem hierarchy (/backups), and disabling the cronjobs that expected it to be mounted. The system has now been running for another two days (after a reboot after the last panic) without any further panics.
I still have poolA, though attached to a different system, and all of the data that was sent to poolB is still on poolA, if that might make it possible to run tests without taking my production system down repeatedly.
This seems possibly related to #3968, #1303, and more. Any ideas?