openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.57k stars 1.74k forks source link

stuck on starting zfs service after box locked up (task sync:2202 blocked for more than 120 seconds.) #2126

Closed dnozay closed 8 years ago

dnozay commented 10 years ago

earlier today:

Feb 12 14:37:00 jenkins kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Feb 13 07:56:59 jenkins kernel: INFO: task txg_sync:974 blocked for more than 120 seconds.
Feb 13 07:56:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 07:56:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 07:56:59 jenkins kernel: txg_sync      D 0000000000000001     0   974      2 0x00000000
Feb 13 07:56:59 jenkins kernel: ffff88013b3e3b60 0000000000000046 0000000000000000 ffff88013b3e3b24
Feb 13 07:56:59 jenkins kernel: 0000000000000000 ffff88013fc23280 ffff880028216840 000000000000739d
Feb 13 07:56:59 jenkins kernel: ffff88013b3d5098 ffff88013b3e3fd8 000000000000fbc8 ffff88013b3d5098
Feb 13 07:56:59 jenkins kernel: Call Trace:
Feb 13 07:56:59 jenkins kernel: [<ffffffff815280b3>] io_schedule+0x73/0xc0
Feb 13 07:56:59 jenkins kernel: [<ffffffffa01a741c>] cv_wait_common+0xac/0x1c0 [spl]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa02ffa00>] ? zio_execute+0x0/0x140 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 07:56:59 jenkins kernel: [<ffffffffa01a7548>] __cv_wait_io+0x18/0x20 [spl]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa02ffc3b>] zio_wait+0xfb/0x1b0 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa0293034>] dsl_pool_sync+0xf4/0x540 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa02a77ae>] spa_sync+0x40e/0xa80 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffff810149b9>] ? read_tsc+0x9/0x20
Feb 13 07:56:59 jenkins kernel: [<ffffffffa02ba707>] txg_sync_thread+0x307/0x590 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffff81059329>] ? set_user_nice+0xc9/0x130
Feb 13 07:56:59 jenkins kernel: [<ffffffffa02ba400>] ? txg_sync_thread+0x0/0x590 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa019f478>] thread_generic_wrapper+0x68/0x80 [spl]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa019f410>] ? thread_generic_wrapper+0x0/0x80 [spl]
Feb 13 07:56:59 jenkins kernel: [<ffffffff8109af06>] kthread+0x96/0xa0
Feb 13 07:56:59 jenkins kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 13 07:56:59 jenkins kernel: [<ffffffff8109ae70>] ? kthread+0x0/0xa0
Feb 13 07:56:59 jenkins kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 13 07:56:59 jenkins kernel: INFO: task java:16888 blocked for more than 120 seconds.
Feb 13 07:56:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 07:56:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 07:56:59 jenkins kernel: java          D 0000000000000002     0 16888      1 0x00000080
Feb 13 07:56:59 jenkins kernel: ffff8800139e1b28 0000000000000082 0000000000000000 ffff880100000003
Feb 13 07:56:59 jenkins kernel: 0000000300000044 0000000000000001 ffff8800139e1bc8 0000000000000086
Feb 13 07:56:59 jenkins kernel: ffff88012f4e7058 ffff8800139e1fd8 000000000000fbc8 ffff88012f4e7058
Feb 13 07:56:59 jenkins kernel: Call Trace:
Feb 13 07:56:59 jenkins kernel: [<ffffffffa01a7475>] cv_wait_common+0x105/0x1c0 [spl]
Feb 13 07:56:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 07:56:59 jenkins kernel: [<ffffffffa01a7585>] __cv_wait+0x15/0x20 [spl]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa02b9dbb>] txg_wait_open+0x8b/0x110 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa027a7cd>] dmu_tx_wait+0xed/0xf0 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa02f384e>] zfs_write+0x3be/0xca0 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffff81068ff5>] ? enqueue_entity+0x125/0x450
Feb 13 07:56:59 jenkins kernel: [<ffffffffa03072c2>] zpl_write_common+0x52/0x70 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffffa0307348>] zpl_write+0x68/0xa0 [zfs]
Feb 13 07:56:59 jenkins kernel: [<ffffffff812263d6>] ? security_file_permission+0x16/0x20
Feb 13 07:56:59 jenkins kernel: [<ffffffff81188f88>] vfs_write+0xb8/0x1a0
Feb 13 07:56:59 jenkins kernel: [<ffffffff81189881>] sys_write+0x51/0x90
Feb 13 07:56:59 jenkins kernel: [<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290
Feb 13 07:56:59 jenkins kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 13 07:58:59 jenkins kernel: INFO: task txg_sync:974 blocked for more than 120 seconds.
Feb 13 07:58:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 07:58:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 07:58:59 jenkins kernel: txg_sync      D 0000000000000001     0   974      2 0x00000000
Feb 13 07:58:59 jenkins kernel: ffff88013b3e3b60 0000000000000046 0000000000000000 ffff88013b3e3b24
Feb 13 07:58:59 jenkins kernel: 0000000000000000 ffff88013fc23280 ffff880028216840 000000000000739d
Feb 13 07:58:59 jenkins kernel: ffff88013b3d5098 ffff88013b3e3fd8 000000000000fbc8 ffff88013b3d5098
Feb 13 07:58:59 jenkins kernel: Call Trace:
Feb 13 07:58:59 jenkins kernel: [<ffffffff815280b3>] io_schedule+0x73/0xc0
Feb 13 07:58:59 jenkins kernel: [<ffffffffa01a741c>] cv_wait_common+0xac/0x1c0 [spl]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa02ffa00>] ? zio_execute+0x0/0x140 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 07:58:59 jenkins kernel: [<ffffffffa01a7548>] __cv_wait_io+0x18/0x20 [spl]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa02ffc3b>] zio_wait+0xfb/0x1b0 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa0293034>] dsl_pool_sync+0xf4/0x540 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa02a77ae>] spa_sync+0x40e/0xa80 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffff810149b9>] ? read_tsc+0x9/0x20
Feb 13 07:58:59 jenkins kernel: [<ffffffffa02ba707>] txg_sync_thread+0x307/0x590 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffff81059329>] ? set_user_nice+0xc9/0x130
Feb 13 07:58:59 jenkins kernel: [<ffffffffa02ba400>] ? txg_sync_thread+0x0/0x590 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa019f478>] thread_generic_wrapper+0x68/0x80 [spl]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa019f410>] ? thread_generic_wrapper+0x0/0x80 [spl]
Feb 13 07:58:59 jenkins kernel: [<ffffffff8109af06>] kthread+0x96/0xa0
Feb 13 07:58:59 jenkins kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 13 07:58:59 jenkins kernel: [<ffffffff8109ae70>] ? kthread+0x0/0xa0
Feb 13 07:58:59 jenkins kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 13 07:58:59 jenkins kernel: INFO: task java:16888 blocked for more than 120 seconds.
Feb 13 07:58:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 07:58:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 07:58:59 jenkins kernel: java          D 0000000000000002     0 16888      1 0x00000080
Feb 13 07:58:59 jenkins kernel: ffff8800139e1b28 0000000000000082 0000000000000000 ffff880100000003
Feb 13 07:58:59 jenkins kernel: 0000000300000044 0000000000000001 ffff8800139e1bc8 0000000000000086
Feb 13 07:58:59 jenkins kernel: ffff88012f4e7058 ffff8800139e1fd8 000000000000fbc8 ffff88012f4e7058
Feb 13 07:58:59 jenkins kernel: Call Trace:
Feb 13 07:58:59 jenkins kernel: [<ffffffffa01a7475>] cv_wait_common+0x105/0x1c0 [spl]
Feb 13 07:58:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 07:58:59 jenkins kernel: [<ffffffffa01a7585>] __cv_wait+0x15/0x20 [spl]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa02b9dbb>] txg_wait_open+0x8b/0x110 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa027a7cd>] dmu_tx_wait+0xed/0xf0 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa02f384e>] zfs_write+0x3be/0xca0 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffff81068ff5>] ? enqueue_entity+0x125/0x450
Feb 13 07:58:59 jenkins kernel: [<ffffffffa03072c2>] zpl_write_common+0x52/0x70 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffffa0307348>] zpl_write+0x68/0xa0 [zfs]
Feb 13 07:58:59 jenkins kernel: [<ffffffff812263d6>] ? security_file_permission+0x16/0x20
Feb 13 07:58:59 jenkins kernel: [<ffffffff81188f88>] vfs_write+0xb8/0x1a0
Feb 13 07:58:59 jenkins kernel: [<ffffffff81189881>] sys_write+0x51/0x90
Feb 13 07:58:59 jenkins kernel: [<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290
Feb 13 07:58:59 jenkins kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 13 08:00:59 jenkins kernel: INFO: task txg_sync:974 blocked for more than 120 seconds.
Feb 13 08:00:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 08:00:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 08:00:59 jenkins kernel: txg_sync      D 0000000000000001     0   974      2 0x00000000
Feb 13 08:00:59 jenkins kernel: ffff88013b3e3b60 0000000000000046 0000000000000000 ffff88013b3e3b24
Feb 13 08:00:59 jenkins kernel: 0000000000000000 ffff88013fc23280 ffff880028216840 000000000000739d
Feb 13 08:00:59 jenkins kernel: ffff88013b3d5098 ffff88013b3e3fd8 000000000000fbc8 ffff88013b3d5098
Feb 13 08:00:59 jenkins kernel: Call Trace:
Feb 13 08:00:59 jenkins kernel: [<ffffffff815280b3>] io_schedule+0x73/0xc0
Feb 13 08:00:59 jenkins kernel: [<ffffffffa01a741c>] cv_wait_common+0xac/0x1c0 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02ffa00>] ? zio_execute+0x0/0x140 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 08:00:59 jenkins kernel: [<ffffffffa01a7548>] __cv_wait_io+0x18/0x20 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02ffc3b>] zio_wait+0xfb/0x1b0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa0293034>] dsl_pool_sync+0xf4/0x540 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02a77ae>] spa_sync+0x40e/0xa80 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff810149b9>] ? read_tsc+0x9/0x20
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02ba707>] txg_sync_thread+0x307/0x590 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff81059329>] ? set_user_nice+0xc9/0x130
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02ba400>] ? txg_sync_thread+0x0/0x590 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa019f478>] thread_generic_wrapper+0x68/0x80 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa019f410>] ? thread_generic_wrapper+0x0/0x80 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8109af06>] kthread+0x96/0xa0
Feb 13 08:00:59 jenkins kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 13 08:00:59 jenkins kernel: [<ffffffff8109ae70>] ? kthread+0x0/0xa0
Feb 13 08:00:59 jenkins kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 13 08:00:59 jenkins kernel: INFO: task java:16888 blocked for more than 120 seconds.
Feb 13 08:00:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 08:00:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 08:00:59 jenkins kernel: java          D 0000000000000002     0 16888      1 0x00000080
Feb 13 08:00:59 jenkins kernel: ffff8800139e1b28 0000000000000082 0000000000000000 ffff880100000003
Feb 13 08:00:59 jenkins kernel: 0000000300000044 0000000000000001 ffff8800139e1bc8 0000000000000086
Feb 13 08:00:59 jenkins kernel: ffff88012f4e7058 ffff8800139e1fd8 000000000000fbc8 ffff88012f4e7058
Feb 13 08:00:59 jenkins kernel: Call Trace:
Feb 13 08:00:59 jenkins kernel: [<ffffffffa01a7475>] cv_wait_common+0x105/0x1c0 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 08:00:59 jenkins kernel: [<ffffffffa01a7585>] __cv_wait+0x15/0x20 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02b9dbb>] txg_wait_open+0x8b/0x110 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa027a7cd>] dmu_tx_wait+0xed/0xf0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02f384e>] zfs_write+0x3be/0xca0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff81068ff5>] ? enqueue_entity+0x125/0x450
Feb 13 08:00:59 jenkins kernel: [<ffffffffa03072c2>] zpl_write_common+0x52/0x70 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa0307348>] zpl_write+0x68/0xa0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff812263d6>] ? security_file_permission+0x16/0x20
Feb 13 08:00:59 jenkins kernel: [<ffffffff81188f88>] vfs_write+0xb8/0x1a0
Feb 13 08:00:59 jenkins kernel: [<ffffffff81189881>] sys_write+0x51/0x90
Feb 13 08:00:59 jenkins kernel: [<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290
Feb 13 08:00:59 jenkins kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 13 08:00:59 jenkins kernel: INFO: task java:17258 blocked for more than 120 seconds.
Feb 13 08:00:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 08:00:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 08:00:59 jenkins kernel: java          D 0000000000000003     0 17258      1 0x00000080
Feb 13 08:00:59 jenkins kernel: ffff88007ba09618 0000000000000082 0000000000000000 ffff880000000002
Feb 13 08:00:59 jenkins kernel: 0000000200000044 0000000000000001 ffff88007ba096b8 0000000000000082
Feb 13 08:00:59 jenkins kernel: ffff880138980638 ffff88007ba09fd8 000000000000fbc8 ffff880138980638
Feb 13 08:00:59 jenkins kernel: Call Trace:
Feb 13 08:00:59 jenkins kernel: [<ffffffffa01a7475>] cv_wait_common+0x105/0x1c0 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 08:00:59 jenkins kernel: [<ffffffffa01a7585>] __cv_wait+0x15/0x20 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02b9dbb>] txg_wait_open+0x8b/0x110 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa027a7cd>] dmu_tx_wait+0xed/0xf0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa027a8ae>] dmu_tx_assign+0x8e/0x4e0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa026d786>] dmu_free_long_range_impl+0x1a6/0x260 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa026d88b>] dmu_free_long_range+0x4b/0x70 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02f617b>] zfs_trunc+0x6b/0x1f0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02a5123>] ? sa_lookup_impl+0x13/0x20 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02a51ed>] ? sa_lookup+0x4d/0x60 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02f7c2b>] zfs_freesp+0x33b/0x3b0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff81149462>] ? unmap_mapping_range+0x72/0x140
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02ef7a6>] zfs_setattr+0xd06/0x17a0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8122f874>] ? inode_has_perm+0x54/0xa0
Feb 13 08:00:59 jenkins kernel: [<ffffffff8116fc7c>] ? __kmalloc+0x20c/0x220
Feb 13 08:00:59 jenkins kernel: [<ffffffffa019df8f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8122fe4b>] ? dentry_has_perm+0x5b/0x80
Feb 13 08:00:59 jenkins kernel: [<ffffffffa030791c>] zpl_setattr+0xdc/0x110 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff811a7348>] notify_change+0x168/0x340
Feb 13 08:00:59 jenkins kernel: [<ffffffff81187074>] do_truncate+0x64/0xa0
Feb 13 08:00:59 jenkins kernel: [<ffffffff8119bcc1>] do_filp_open+0x861/0xd20
Feb 13 08:00:59 jenkins kernel: [<ffffffff811a8222>] ? alloc_fd+0x92/0x160
Feb 13 08:00:59 jenkins kernel: [<ffffffff81185d39>] do_sys_open+0x69/0x140
Feb 13 08:00:59 jenkins kernel: [<ffffffff81185e50>] sys_open+0x20/0x30
Feb 13 08:00:59 jenkins kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 13 08:00:59 jenkins kernel: INFO: task java:17271 blocked for more than 120 seconds.
Feb 13 08:00:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 08:00:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 08:00:59 jenkins kernel: java          D 0000000000000003     0 17271      1 0x00000080
Feb 13 08:00:59 jenkins kernel: ffff880100c37618 0000000000000082 0000000000000000 ffff880000000002
Feb 13 08:00:59 jenkins kernel: 0000000200000044 0000000000000001 ffff880100c376b8 0000000000000082
Feb 13 08:00:59 jenkins kernel: ffff88011bb26638 ffff880100c37fd8 000000000000fbc8 ffff88011bb26638
Feb 13 08:00:59 jenkins kernel: Call Trace:
Feb 13 08:00:59 jenkins kernel: [<ffffffffa01a7475>] cv_wait_common+0x105/0x1c0 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 08:00:59 jenkins kernel: [<ffffffffa01a7585>] __cv_wait+0x15/0x20 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02b9dbb>] txg_wait_open+0x8b/0x110 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa027a7cd>] dmu_tx_wait+0xed/0xf0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa027a8ae>] dmu_tx_assign+0x8e/0x4e0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa026d786>] dmu_free_long_range_impl+0x1a6/0x260 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa026d88b>] dmu_free_long_range+0x4b/0x70 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02f617b>] zfs_trunc+0x6b/0x1f0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02a5123>] ? sa_lookup_impl+0x13/0x20 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02a51ed>] ? sa_lookup+0x4d/0x60 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02f7c2b>] zfs_freesp+0x33b/0x3b0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff81149462>] ? unmap_mapping_range+0x72/0x140
Feb 13 08:00:59 jenkins kernel: [<ffffffffa02ef7a6>] zfs_setattr+0xd06/0x17a0 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8122f874>] ? inode_has_perm+0x54/0xa0
Feb 13 08:00:59 jenkins kernel: [<ffffffff8116fc7c>] ? __kmalloc+0x20c/0x220
Feb 13 08:00:59 jenkins kernel: [<ffffffffa019df8f>] ? kmem_alloc_debug+0x8f/0x4c0 [spl]
Feb 13 08:00:59 jenkins kernel: [<ffffffff8122fe4b>] ? dentry_has_perm+0x5b/0x80
Feb 13 08:00:59 jenkins kernel: [<ffffffffa030791c>] zpl_setattr+0xdc/0x110 [zfs]
Feb 13 08:00:59 jenkins kernel: [<ffffffff811a7348>] notify_change+0x168/0x340
Feb 13 08:00:59 jenkins kernel: [<ffffffff81187074>] do_truncate+0x64/0xa0
Feb 13 08:00:59 jenkins kernel: [<ffffffff8119bcc1>] do_filp_open+0x861/0xd20
Feb 13 08:00:59 jenkins kernel: [<ffffffff811a8222>] ? alloc_fd+0x92/0x160
Feb 13 08:00:59 jenkins kernel: [<ffffffff81185d39>] do_sys_open+0x69/0x140
Feb 13 08:00:59 jenkins kernel: [<ffffffff81185e50>] sys_open+0x20/0x30
Feb 13 08:00:59 jenkins kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 13 08:02:59 jenkins kernel: INFO: task txg_sync:974 blocked for more than 120 seconds.
Feb 13 08:02:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 08:02:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 08:02:59 jenkins kernel: txg_sync      D 0000000000000001     0   974      2 0x00000000
Feb 13 08:02:59 jenkins kernel: ffff88013b3e3b60 0000000000000046 0000000000000000 ffff88013b3e3b24
Feb 13 08:02:59 jenkins kernel: 0000000000000000 ffff88013fc23280 ffff880028216840 000000000000739d
Feb 13 08:02:59 jenkins kernel: ffff88013b3d5098 ffff88013b3e3fd8 000000000000fbc8 ffff88013b3d5098
Feb 13 08:02:59 jenkins kernel: Call Trace:
Feb 13 08:02:59 jenkins kernel: [<ffffffff815280b3>] io_schedule+0x73/0xc0
Feb 13 08:02:59 jenkins kernel: [<ffffffffa01a741c>] cv_wait_common+0xac/0x1c0 [spl]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa02ffa00>] ? zio_execute+0x0/0x140 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 08:02:59 jenkins kernel: [<ffffffffa01a7548>] __cv_wait_io+0x18/0x20 [spl]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa02ffc3b>] zio_wait+0xfb/0x1b0 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa0293034>] dsl_pool_sync+0xf4/0x540 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa02a77ae>] spa_sync+0x40e/0xa80 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffff810149b9>] ? read_tsc+0x9/0x20
Feb 13 08:02:59 jenkins kernel: [<ffffffffa02ba707>] txg_sync_thread+0x307/0x590 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffff81059329>] ? set_user_nice+0xc9/0x130
Feb 13 08:02:59 jenkins kernel: [<ffffffffa02ba400>] ? txg_sync_thread+0x0/0x590 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa019f478>] thread_generic_wrapper+0x68/0x80 [spl]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa019f410>] ? thread_generic_wrapper+0x0/0x80 [spl]
Feb 13 08:02:59 jenkins kernel: [<ffffffff8109af06>] kthread+0x96/0xa0
Feb 13 08:02:59 jenkins kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 13 08:02:59 jenkins kernel: [<ffffffff8109ae70>] ? kthread+0x0/0xa0
Feb 13 08:02:59 jenkins kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Feb 13 08:02:59 jenkins kernel: INFO: task java:16888 blocked for more than 120 seconds.
Feb 13 08:02:59 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 08:02:59 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 08:02:59 jenkins kernel: java          D 0000000000000002     0 16888      1 0x00000080
Feb 13 08:02:59 jenkins kernel: ffff8800139e1b28 0000000000000082 0000000000000000 ffff880100000003
Feb 13 08:02:59 jenkins kernel: 0000000300000044 0000000000000001 ffff8800139e1bc8 0000000000000086
Feb 13 08:02:59 jenkins kernel: ffff88012f4e7058 ffff8800139e1fd8 000000000000fbc8 ffff88012f4e7058
Feb 13 08:02:59 jenkins kernel: Call Trace:
Feb 13 08:02:59 jenkins kernel: [<ffffffffa01a7475>] cv_wait_common+0x105/0x1c0 [spl]
Feb 13 08:02:59 jenkins kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 13 08:02:59 jenkins kernel: [<ffffffffa01a7585>] __cv_wait+0x15/0x20 [spl]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa02b9dbb>] txg_wait_open+0x8b/0x110 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa027a7cd>] dmu_tx_wait+0xed/0xf0 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa02f384e>] zfs_write+0x3be/0xca0 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffff81068ff5>] ? enqueue_entity+0x125/0x450
Feb 13 08:02:59 jenkins kernel: [<ffffffffa03072c2>] zpl_write_common+0x52/0x70 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffffa0307348>] zpl_write+0x68/0xa0 [zfs]
Feb 13 08:02:59 jenkins kernel: [<ffffffff812263d6>] ? security_file_permission+0x16/0x20
Feb 13 08:02:59 jenkins kernel: [<ffffffff81188f88>] vfs_write+0xb8/0x1a0
Feb 13 08:02:59 jenkins kernel: [<ffffffff81189881>] sys_write+0x51/0x90
Feb 13 08:02:59 jenkins kernel: [<ffffffff810e1e6e>] ? __audit_syscall_exit+0x25e/0x290
Feb 13 08:02:59 jenkins kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 13 16:18:45 jenkins kernel: imklog 5.8.10, log source = /proc/kmsg started.

yes, there's nothing in the messages between 08:02:59 and 16:18:45.

I am using a CentOS 6.5 with the EPEL packages. The machine is a VM with a second 150G disk (/dev/sdb). The machine has 4 vCPUs and 4G of memory. The machine is used as a jenkins server (java task that gets hung)

[root@jenkins ~]# uname -a
Linux jenkins.prod 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Here is the list of packages

Installed Packages
spl.x86_64                                                                                                                                 0.6.2-1.el6                                                                                                                                    @zfs                        
spl-dkms.noarch                                                                                                                            0.6.2-1.el6                                                                                                                                    @zfs                        
zfs.x86_64                                                                                                                                 0.6.2-1.el6                                                                                                                                    @zfs                        
zfs-dkms.noarch                                                                                                                            0.6.2-1.el6                                                                                                                                    @zfs                        
zfs-release.noarch                                                                                                                         1-2.el6                                                                                                                                        @/zfs-release-1-2.el6.noarch

The second disk as I mentioned is used for zfs and the OS should use the other filesystem for logs and other stuff.

[root@jenkins ~]# zfs list
NAME               USED  AVAIL  REFER  MOUNTPOINT
zfs-data          12.1G   139G    31K  /zfs-data
zfs-data/jenkins  12.1G   139G  12.1G  /zfs-data/jenkins
[root@jenkins ~]# zpool list
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zfs-data   149G  8.07G   141G     5%  1.50x  ONLINE  -

/zfs-data/jenkins is used as the home directory for the jenkins server, this will store the workspaces and builds for multiple concurrent jobs. Everything stopped working around 15:40; webserver, ssh sessions, everything was crawling.

After reboot, I also had some issues mounting the zfs volumes.

Feb 13 16:19:19 jenkins abrtd: Init complete, entering main loop
Feb 13 16:25:36 jenkins kernel: INFO: task sync:2202 blocked for more than 120 seconds.
Feb 13 16:25:36 jenkins kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 13 16:25:36 jenkins kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 16:25:36 jenkins kernel: sync          D 0000000000000000     0  2202   2195 0x00000080
Feb 13 16:25:36 jenkins kernel: ffff88011e1bfe30 0000000000000086 0000000000000000 ffff880028316840
Feb 13 16:25:36 jenkins kernel: ffff880028316840 ffff88011e1bfda8 ffffffff81030b59 ffff88011e1bfdb8
Feb 13 16:25:36 jenkins kernel: ffff88013c549af8 ffff88011e1bffd8 000000000000fbc8 ffff88013c549af8
Feb 13 16:25:36 jenkins kernel: Call Trace:
Feb 13 16:25:36 jenkins kernel: [<ffffffff81030b59>] ? native_smp_send_reschedule+0x49/0x60
Feb 13 16:25:36 jenkins kernel: [<ffffffff8152846c>] ? wait_for_common+0x14c/0x180
Feb 13 16:25:36 jenkins kernel: [<ffffffff81529f95>] rwsem_down_failed_common+0x95/0x1d0
Feb 13 16:25:36 jenkins kernel: [<ffffffff8152a126>] rwsem_down_read_failed+0x26/0x30
Feb 13 16:25:36 jenkins kernel: [<ffffffff8128e864>] call_rwsem_down_read_failed+0x14/0x30
Feb 13 16:25:36 jenkins kernel: [<ffffffff81529624>] ? down_read+0x24/0x30
Feb 13 16:25:36 jenkins kernel: [<ffffffff811bae06>] sync_filesystems+0xd6/0x130
Feb 13 16:25:36 jenkins kernel: [<ffffffff811baeb7>] sys_sync+0x17/0x40
Feb 13 16:25:36 jenkins kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Feb 13 16:33:06 jenkins kernel: RPC: Registered named UNIX socket transport module.
Feb 13 16:33:06 jenkins kernel: RPC: Registered udp transport module.
Feb 13 16:33:06 jenkins kernel: RPC: Registered tcp transport module.

here is the output of zdb and zpool status (after reboot)

[root@jenkins ~]# zdb
zfs-data:
    version: 5000
    name: 'zfs-data'
    state: 0
    txg: 4
    pool_guid: 17215845446950085039
    hostid: 168453376
    hostname: 'jenkins.[REDACTED]'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 17215845446950085039
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 10934433629414673041
            path: '/dev/sdb1'
            whole_disk: 1
            metaslab_array: 33
            metaslab_shift: 30
            ashift: 9
            asize: 161046069248
            is_log: 0
            create_txg: 4
    features_for_read:
[root@jenkins ~]# zpool status
  pool: zfs-data
 state: ONLINE
  scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    zfs-data    ONLINE       0     0     0
      sdb       ONLINE       0     0     0

errors: No known data errors

no /proc/spl/kstat/zfs/arcstats or /proc/spl/kmem/slab from during the hang.

dnozay commented 10 years ago

note, I couldn't do anything in between because of issue #1621 I believe.

dnozay commented 10 years ago

other symptoms:

behlendorf commented 10 years ago

@dnozay This is very similar to #1928. Thus far we haven't been able to identify the root cause because it occurs very very very rarely but we're aware of the issue and have been investigating.

dnozay commented 10 years ago

in case it is relevant, I had dedup=on, and compression=on; as you may have noted there is also ashift: 9 but blockdev --getbsz /dev/sdb returns 4096.

@behlendorf, since I need to get this working, I am more than willing to help debug. Yesterday, the jenkins slaves also died in a deadlock-like fashion. Could you please see https://gist.github.com/dnozay/9011860 and let me know if it is related? thank you.

dnozay commented 10 years ago

Happened again last night, but couldn't trigger SysRq. Upon reboot, found out it was not even enabled in /etc/sysctl.conf; set kernel.sysrq = 1 then proceeded with reboot. If it happens again in the next few days I will be ready.

dnozay commented 10 years ago

/var/log/messages portion that survived reboot https://gist.github.com/dnozay/9084691#file-messages-txt

Feb 18 14:45:36 jenkins-slave2 kernel: INFO: task spl_kmem_cache/:706 blocked for more than 120 seconds.
Feb 18 14:45:36 jenkins-slave2 kernel:      Tainted: P           ---------------    2.6.32-431.3.1.el6.x86_64 #1
Feb 18 14:45:36 jenkins-slave2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 18 14:45:36 jenkins-slave2 kernel: spl_kmem_cach D 0000000000000000     0   706      2 0x00000000
Feb 18 14:45:36 jenkins-slave2 kernel: ffff88013c9992b0 0000000000000046 0000000000000000 ffffffffa02faa40
Feb 18 14:45:36 jenkins-slave2 kernel: ffff8800bad5d6d0 00000002627e8358 0000000000000004 00000000a03344e3
Feb 18 14:45:36 jenkins-slave2 kernel: ffff8801385bdab8 ffff88013c999fd8 000000000000fbc8 ffff8801385bdab8
Feb 18 14:45:36 jenkins-slave2 kernel: Call Trace:
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa02faa40>] ? vdev_mirror_child_done+0x0/0x30 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff815280b3>] io_schedule+0x73/0xc0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01cd41c>] cv_wait_common+0xac/0x1c0 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8109b2b0>] ? autoremove_wake_function+0x0/0x40
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01cd548>] __cv_wait_io+0x18/0x20 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa0335c3b>] zio_wait+0xfb/0x1b0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa029c58d>] dbuf_read+0x3fd/0x750 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa02a3fc8>] dmu_buf_hold+0x108/0x1d0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa030060f>] zap_get_leaf_byblk+0x4f/0x2b0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa03008da>] zap_deref_leaf+0x6a/0x80 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa03010c7>] fzap_remove+0x37/0xb0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa030437d>] ? zap_name_alloc+0x8d/0xf0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa0306113>] zap_remove_norm+0x153/0x1d0 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa03061a3>] zap_remove+0x13/0x20 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa02ff551>] zap_remove_int+0x61/0x90 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa030efa7>] zfs_rmnode+0x1c7/0x340 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa032c765>] zfs_zinactive+0xa5/0x110 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa032b75f>] zfs_inactive+0x7f/0x220 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa033e11e>] zpl_clear_inode+0xe/0x10 [zfs]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811a5bec>] clear_inode+0xac/0x140
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811a5cc0>] dispose_list+0x40/0x120
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811a6014>] shrink_icache_memory+0x274/0x2e0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81138aea>] shrink_slab+0x12a/0x1a0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8113aec7>] do_try_to_free_pages+0x3f7/0x610
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8112d2fc>] ? get_page_from_freelist+0x15c/0x870
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8113b2b2>] try_to_free_pages+0x92/0x120
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811411c0>] ? next_zone+0x30/0x40
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8112f718>] __alloc_pages_nodemask+0x478/0x8d0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81167aaa>] alloc_pages_current+0xaa/0x110
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8112cf4e>] __get_free_pages+0xe/0x50
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8104ec85>] pte_alloc_one_kernel+0x15/0x20
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8114634b>] __pte_alloc_kernel+0x1b/0xc0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81157579>] vmap_page_range_noflush+0x309/0x370
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81157612>] map_vm_area+0x32/0x50
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81159080>] __vmalloc_area_node+0x100/0x190
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c2ee9>] ? kv_alloc+0x59/0x60 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81158f0d>] __vmalloc_node+0xad/0x120
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c2ee9>] ? kv_alloc+0x59/0x60 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff811592f2>] __vmalloc+0x22/0x30
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c2ee9>] kv_alloc+0x59/0x60 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c2f29>] spl_cache_grow_work+0x39/0x410 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81058d53>] ? __wake_up+0x53/0x70
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c6628>] taskq_thread+0x218/0x4b0 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81527920>] ? thread_return+0x4e/0x76e
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff81065df0>] ? default_wake_function+0x0/0x20
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffffa01c6410>] ? taskq_thread+0x0/0x4b0 [spl]
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8109af06>] kthread+0x96/0xa0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8109ae70>] ? kthread+0x0/0xa0
Feb 18 14:45:36 jenkins-slave2 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

/proc/spl/kstat/zfs/arcstats 5min before the crash; couldn't do it afterwards as everything was hung: https://gist.github.com/dnozay/9084691#file-arcstats-txt

ps output 5min before the crash: https://gist.github.com/dnozay/9084691#file-ps-txt

/proc/spl/kmem/slab 5min before the crash: https://gist.github.com/dnozay/9084691#file-slab-txt

behlendorf commented 8 years ago

Closing as stale. The issue was originally reported again 0.6.2. If you're able to reproduce this against 0.6.5.x or newly please let us know.