open-iscsi / tcmu-runner

A daemon that handles the userspace side of the LIO TCM-User backstore.
Apache License 2.0
188 stars 148 forks source link

Stop TCMU userspace daemon when writting data to the device result in kernel hang task #6

Closed yasker closed 8 years ago

yasker commented 9 years ago

I am trying to use TCMU for a high-availability task, which involve writing a userspace daemon to provide data of an simulated local block device. But I found I can easily hang the block device generated by TCMU.

Environment: Linux v4.2-rc2, use file_handler example and tcm_loop.

Reproduce step:

  1. Create a block device using file_handler and tcm_loop.
  2. Mount it locally
  3. Execute ```dd if=/dev/zero of=test"
  4. After a few seconds, stop the tcmu-runner()

It cause immediate hang for dd, and remove the block device(e.g. unlink /sys/kernel/config/target/loopback/naa.60014059436855c1/tpgt_1/lun/lun_0/c68d79910b) would be hang as well.

I enabled kernel debug and got this after a few minutes:

[ 6052.703055] 0 completions handled from timeout
[ 6052.713565] ABORT_TASK: Found referenced loopback task_tag: 1
[ 6052.715992] INFO: trying to register non-static key.
[ 6052.716977] the code is fine but needs lockdep annotation.
[ 6052.716977] turning off the locking correctness validator.
[ 6052.716977] CPU: 3 PID: 6 Comm: kworker/u8:0 Not tainted 4.3.0-rc2+ #14
[ 6052.716977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
[ 6052.716977] Workqueue: tmr-user target_tmr_work
[ 6052.716977]  0000000000000000 ffff88013ab47ad0 ffffffff813427f2 ffff88013ab16680
[ 6052.716977]  ffff88013ab47ae0 ffffffff8112930b ffff88013ab47b78 ffffffff81098b1f
[ 6052.716977]  0000000000000019 0000000000000001 ffffffff81075502 ffffffff81075490
[ 6052.716977] Call Trace:
[ 6052.716977]  [<ffffffff813427f2>] dump_stack+0x44/0x62
[ 6052.716977]  [<ffffffff8112930b>] register_lock_class.part.22+0x38/0x3c
[ 6052.716977]  [<ffffffff81098b1f>] __lock_acquire.isra.30+0xf1f/0xf40
[ 6052.716977]  [<ffffffff81075502>] ? __atomic_notifier_call_chain+0x72/0x90
[ 6052.716977]  [<ffffffff81075490>] ? srcu_init_notifier_head+0x50/0x50
[ 6052.716977]  [<ffffffff81075502>] ? __atomic_notifier_call_chain+0x72/0x90
[ 6052.716977]  [<ffffffff810994b3>] lock_acquire+0xa3/0x130
[ 6052.716977]  [<ffffffff8106d240>] ? __queue_delayed_work+0x190/0x190
[ 6052.716977]  [<ffffffff8106d278>] flush_work+0x38/0x250
[ 6052.716977]  [<ffffffff8106d240>] ? __queue_delayed_work+0x190/0x190
[ 6052.716977]  [<ffffffff810a80dd>] ? wake_up_klogd+0x2d/0x30
[ 6052.716977]  [<ffffffff810a82d9>] ? console_unlock+0x1f9/0x4d0
[ 6052.716977]  [<ffffffff810a8874>] ? vprintk_emit+0x2c4/0x550
[ 6052.716977]  [<ffffffff8106e233>] __cancel_work_timer+0x93/0x1a0
[ 6052.716977]  [<ffffffff81969cbc>] ? _raw_spin_unlock_irqrestore+0x2c/0x40
[ 6052.716977]  [<ffffffff8106e34b>] cancel_work_sync+0xb/0x10
[ 6052.716977]  [<ffffffff815850ba>] core_tmr_abort_task+0x12a/0x200
[ 6052.716977]  [<ffffffff81587802>] target_tmr_work+0xc2/0xd0
[ 6052.716977]  [<ffffffff8106e523>] process_one_work+0x1c3/0x4c0
[ 6052.716977]  [<ffffffff8106e4b1>] ? process_one_work+0x151/0x4c0
[ 6052.716977]  [<ffffffff8106eb46>] worker_thread+0x46/0x440
[ 6052.716977]  [<ffffffff8106eb00>] ? rescuer_thread+0x2e0/0x2e0
[ 6052.716977]  [<ffffffff81073f0f>] kthread+0xdf/0x100
[ 6052.716977]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6052.716977]  [<ffffffff8196a81f>] ret_from_fork+0x3f/0x70
[ 6052.716977]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.128088] INFO: task kworker/u8:0:6 blocked for more than 120 seconds.
[ 6240.130915]       Not tainted 4.3.0-rc2+ #14
[ 6240.132887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6240.136527] kworker/u8:0    D ffff88013fd95f58 12192     6      2 0x00000000
[ 6240.139429] Workqueue: tmr-user target_tmr_work
[ 6240.141525]  ffff88013ab47bd8 0000000000000082 ffffffff8106d447 ffff880100000000
[ 6240.145131]  ffff88013ab68000 ffff88013ab16680 ffff88013ab48000 ffff8800a05efda0
[ 6240.148644]  ffff8800a05efd98 ffff88013ab16680 ffff8800a05efd58 ffff88013ab47bf0
[ 6240.152179] Call Trace:
[ 6240.153548]  [<ffffffff8106d447>] ? flush_work+0x207/0x250
[ 6240.155646]  [<ffffffff81964fbe>] schedule+0x2e/0x70
[ 6240.157584]  [<ffffffff8196908c>] schedule_timeout+0x1bc/0x2b0
[ 6240.159730]  [<ffffffff81966143>] ? wait_for_completion+0x23/0x100
[ 6240.161965]  [<ffffffff819661ae>] ? wait_for_completion+0x8e/0x100
[ 6240.164189]  [<ffffffff819661b6>] wait_for_completion+0x96/0x100
[ 6240.166411]  [<ffffffff8107e1f0>] ? wake_up_q+0x60/0x60
[ 6240.168292]  [<ffffffff815876f9>] transport_wait_for_tasks+0x69/0xb0
[ 6240.170464]  [<ffffffff815850c2>] core_tmr_abort_task+0x132/0x200
[ 6240.172543]  [<ffffffff81587802>] target_tmr_work+0xc2/0xd0
[ 6240.174521]  [<ffffffff8106e523>] process_one_work+0x1c3/0x4c0
[ 6240.176530]  [<ffffffff8106e4b1>] ? process_one_work+0x151/0x4c0
[ 6240.178623]  [<ffffffff8106eb46>] worker_thread+0x46/0x440
[ 6240.180460]  [<ffffffff8106eb00>] ? rescuer_thread+0x2e0/0x2e0
[ 6240.182401]  [<ffffffff81073f0f>] kthread+0xdf/0x100
[ 6240.184120]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.186215]  [<ffffffff8196a81f>] ret_from_fork+0x3f/0x70
[ 6240.188041]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.190123] INFO: lockdep is turned off.
[ 6240.191569] INFO: task kworker/u8:2:6027 blocked for more than 120 seconds.
[ 6240.193658]       Not tainted 4.3.0-rc2+ #14
[ 6240.195153] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6240.197948] kworker/u8:2    D ffff88013fd95f58 11728  6027      2 0x00000000
[ 6240.200119] Workqueue: scsi_tmf_2 scmd_eh_abort_handler
[ 6240.201856]  ffff880139abfcd0 0000000000000086 ffff8800a0749248 0000000000000000
[ 6240.204507]  ffff88013ab16680 ffff8800b8033d80 ffff880139ac0000 ffff8800a02991b8
[ 6240.207275]  ffff8800a02991a8 ffff8800a0749248 0000000000000003 ffff880139abfce8
[ 6240.209949] Call Trace:
[ 6240.211070]  [<ffffffff81964fbe>] schedule+0x2e/0x70
[ 6240.212587]  [<ffffffff8159a74b>] tcm_loop_issue_tmr+0x12b/0x1e0
[ 6240.214452]  [<ffffffff81093da0>] ? wait_woken+0x80/0x80
[ 6240.216121]  [<ffffffff8159a895>] tcm_loop_abort_task+0x45/0x60
[ 6240.218072]  [<ffffffff815381bd>] scmd_eh_abort_handler+0x5d/0x100
[ 6240.219935]  [<ffffffff8106e523>] process_one_work+0x1c3/0x4c0
[ 6240.221730]  [<ffffffff8106e4b1>] ? process_one_work+0x151/0x4c0
[ 6240.223560]  [<ffffffff8106eb46>] worker_thread+0x46/0x440
[ 6240.225287]  [<ffffffff8106eb00>] ? rescuer_thread+0x2e0/0x2e0
[ 6240.227162]  [<ffffffff8106eb00>] ? rescuer_thread+0x2e0/0x2e0
[ 6240.228974]  [<ffffffff81073f0f>] kthread+0xdf/0x100
[ 6240.230538]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.232538]  [<ffffffff8196a81f>] ret_from_fork+0x3f/0x70
[ 6240.234221]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.236142] INFO: lockdep is turned off.
[ 6240.237464] INFO: task kworker/u8:1:6132 blocked for more than 120 seconds.
[ 6240.239440]       Not tainted 4.3.0-rc2+ #14
[ 6240.240852] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6240.243481] kworker/u8:1    D ffff88013fc95f58 12016  6132      2 0x00000000
[ 6240.245550] Workqueue: writeback wb_workfn (flush-8:0)
[ 6240.247215]  ffff8800bad0b4c8 0000000000000086 0000000000000086 ffffffff00000000
[ 6240.249876]  ffff8801395a0000 ffff880139bc5200 ffff8800bad0c000 ffff88013fc95f40
[ 6240.263528]  7fffffffffffffff 0000000000000002 ffffffff819657a0 ffff8800bad0b4e0
[ 6240.265925] Call Trace:
[ 6240.266863]  [<ffffffff819657a0>] ? bit_wait_timeout+0x60/0x60
[ 6240.268441]  [<ffffffff81964fbe>] schedule+0x2e/0x70
[ 6240.269862]  [<ffffffff8196908c>] schedule_timeout+0x1bc/0x2b0
[ 6240.271418]  [<ffffffff810f7b8a>] ? __delayacct_blkio_start+0x1a/0x30
[ 6240.273172]  [<ffffffff810f7b8a>] ? __delayacct_blkio_start+0x1a/0x30
[ 6240.274868]  [<ffffffff810446de>] ? kvm_clock_get_cycles+0x1e/0x20
[ 6240.276440]  [<ffffffff810bf9c0>] ? ktime_get+0x90/0x110
[ 6240.277861]  [<ffffffff810f7b8a>] ? __delayacct_blkio_start+0x1a/0x30
[ 6240.279475]  [<ffffffff819657a0>] ? bit_wait_timeout+0x60/0x60
[ 6240.281048]  [<ffffffff819643ef>] io_schedule_timeout+0x9f/0x110
[ 6240.282578]  [<ffffffff819657d0>] bit_wait_io+0x30/0x50
[ 6240.284017]  [<ffffffff81965536>] __wait_on_bit_lock+0x46/0xa0
[ 6240.285527]  [<ffffffff819657a0>] ? bit_wait_timeout+0x60/0x60
[ 6240.287027]  [<ffffffff819655fd>] out_of_line_wait_on_bit_lock+0x6d/0x80
[ 6240.288698]  [<ffffffff81093dd0>] ? autoremove_wake_function+0x30/0x30
[ 6240.290360]  [<ffffffff811c2d30>] __lock_buffer+0x30/0x40
[ 6240.291801]  [<ffffffff8125cf5b>] do_get_write_access+0x1fb/0x4a0
[ 6240.293342]  [<ffffffff8125d22e>] jbd2_journal_get_write_access+0x2e/0x60
[ 6240.295051]  [<ffffffff81243d56>] __ext4_journal_get_write_access+0x36/0x80
[ 6240.296754]  [<ffffffff8124aae8>] ext4_mb_mark_diskspace_used+0x68/0x470
[ 6240.298508]  [<ffffffff8124c1de>] ext4_mb_new_blocks+0x2ae/0x530
[ 6240.299998]  [<ffffffff8123c752>] ? ext4_find_extent+0x1b2/0x320
[ 6240.301557]  [<ffffffff81241246>] ext4_ext_map_blocks+0x5d6/0xd90
[ 6240.303155]  [<ffffffff812145aa>] ext4_map_blocks+0x12a/0x500
[ 6240.304665]  [<ffffffff8121769d>] ext4_writepages+0x50d/0xcb0
[ 6240.306134]  [<ffffffff81134590>] ? wb_position_ratio+0x1f0/0x1f0
[ 6240.307762]  [<ffffffff81137bdc>] do_writepages+0x1c/0x30
[ 6240.309166]  [<ffffffff811bac30>] __writeback_single_inode+0x50/0x300
[ 6240.310848]  [<ffffffff81969c82>] ? _raw_spin_unlock+0x22/0x30
[ 6240.312329]  [<ffffffff811bb518>] writeback_sb_inodes+0x298/0x520
[ 6240.313943]  [<ffffffff811bb824>] __writeback_inodes_wb+0x84/0xb0
[ 6240.315500]  [<ffffffff811bbb40>] wb_writeback+0x250/0x2c0
[ 6240.316939]  [<ffffffff811bc253>] wb_workfn+0x213/0x390
[ 6240.318332]  [<ffffffff8106e523>] process_one_work+0x1c3/0x4c0
[ 6240.319875]  [<ffffffff8106e4b1>] ? process_one_work+0x151/0x4c0
[ 6240.321402]  [<ffffffff8106eb46>] worker_thread+0x46/0x440
[ 6240.322887]  [<ffffffff8106eb00>] ? rescuer_thread+0x2e0/0x2e0
[ 6240.324370]  [<ffffffff8106eb00>] ? rescuer_thread+0x2e0/0x2e0
[ 6240.325874]  [<ffffffff81073f0f>] kthread+0xdf/0x100
[ 6240.327261]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.328900]  [<ffffffff8196a81f>] ret_from_fork+0x3f/0x70
[ 6240.330335]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.332014] INFO: lockdep is turned off.
[ 6240.333183] INFO: task jbd2/sda-8:6281 blocked for more than 120 seconds.
[ 6240.334869]       Not tainted 4.3.0-rc2+ #14
[ 6240.336069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6240.338367] jbd2/sda-8      D ffff88013fd95f58 13952  6281      2 0x00000000
[ 6240.340195]  ffff8800b6b67c58 0000000000000086 ffff8800b6b67d40 0000000000000000
[ 6240.342554]  ffff88013ab68000 ffff88009fdb6680 ffff8800b6b68000 ffff8800b6b67d58
[ 6240.344934]  ffff88009fdb6680 ffff8800b6b67d40 ffff8800b6cae8c0 ffff8800b6b67c70
[ 6240.347313] Call Trace:
[ 6240.348230]  [<ffffffff81964fbe>] schedule+0x2e/0x70
[ 6240.349597]  [<ffffffff8125e725>] jbd2_journal_commit_transaction+0x245/0x1760
[ 6240.351834]  [<ffffffff8108d4cd>] ? pick_next_task_fair+0x3fd/0x580
[ 6240.353451]  [<ffffffff81093da0>] ? wait_woken+0x80/0x80
[ 6240.354862]  [<ffffffff810b785a>] ? try_to_del_timer_sync+0x4a/0x60
[ 6240.356431]  [<ffffffff81263170>] kjournald2+0xc0/0x260
[ 6240.357871]  [<ffffffff81093da0>] ? wait_woken+0x80/0x80
[ 6240.359259]  [<ffffffff812630b0>] ? commit_timeout+0x10/0x10
[ 6240.360766]  [<ffffffff81073f0f>] kthread+0xdf/0x100
[ 6240.362141]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.363815]  [<ffffffff8196a81f>] ret_from_fork+0x3f/0x70
[ 6240.365276]  [<ffffffff81073e30>] ? kthread_create_on_node+0x200/0x200
[ 6240.366902] INFO: lockdep is turned off.
[ 6240.368075] INFO: task dd:6467 blocked for more than 120 seconds.
[ 6240.369677]       Not tainted 4.3.0-rc2+ #14
[ 6240.370930] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6240.373262] dd              D ffff88013fc95f58 13752  6467   6466 0x00000004
[ 6240.375094]  ffff8800a0743a08 0000000000000086 0000000000000082 0000000000000000
[ 6240.377492]  ffff88013ab4d200 ffff8800b80ea900 ffff8800a0744000 ffff8800aaba4038
[ 6240.379855]  ffff8800aaba4050 ffff8800aaba40b8 000000000001761f ffff8800a0743a20
[ 6240.382244] Call Trace:
[ 6240.383133]  [<ffffffff81964fbe>] schedule+0x2e/0x70
[ 6240.384533]  [<ffffffff81968ace>] rwsem_down_read_failed+0xde/0x120
[ 6240.386093]  [<ffffffff81350104>] call_rwsem_down_read_failed+0x14/0x30
[ 6240.387750]  [<ffffffff81968442>] ? down_read+0x42/0x50
[ 6240.389191]  [<ffffffff81212ecc>] ? ext4_da_get_block_prep+0x18c/0x410
[ 6240.390890]  [<ffffffff81212ecc>] ext4_da_get_block_prep+0x18c/0x410
[ 6240.392469]  [<ffffffff81969c82>] ? _raw_spin_unlock+0x22/0x30
[ 6240.394036]  [<ffffffff811c46a5>] __block_write_begin+0x155/0x460
[ 6240.395560]  [<ffffffff81212d40>] ? ext4_journalled_invalidatepage+0x30/0x30
[ 6240.397302]  [<ffffffff81243b2c>] ? __ext4_journal_start_sb+0x6c/0xd0
[ 6240.398968]  [<ffffffff81218ba1>] ext4_da_write_begin+0x151/0x2d0
[ 6240.400523]  [<ffffffff8112a372>] generic_perform_write+0xa2/0x1e0
[ 6240.402117]  [<ffffffff811a7a76>] ? file_update_time+0x36/0xe0
[ 6240.403620]  [<ffffffff8112c54b>] __generic_file_write_iter+0x18b/0x1e0
[ 6240.405297]  [<ffffffff8120e750>] ext4_file_write_iter+0x210/0x420
[ 6240.406826]  [<ffffffff8118f67c>] ? __sb_start_write+0xec/0x100
[ 6240.408414]  [<ffffffff8118c5a5>] __vfs_write+0xa5/0xe0
[ 6240.409836]  [<ffffffff8118cbf4>] vfs_write+0xa4/0x190
[ 6240.411246]  [<ffffffff8118cb2f>] ? vfs_read+0x10f/0x130
[ 6240.412671]  [<ffffffff8118d8b4>] SyS_write+0x44/0xa0
[ 6240.414030]  [<ffffffff8196a4b2>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 6240.415642] INFO: lockdep is turned off.
[ 6240.416827] INFO: task unlink:6476 blocked for more than 120 seconds.
[ 6240.418481]       Not tainted 4.3.0-rc2+ #14
[ 6240.419745] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6240.422039] unlink          D ffff88013fd15f58 14160  6476   5638 0x00000000
[ 6240.423770]  ffff8800b6af7b48 0000000000000082 0000000000000002 ffffffff81f4f9d8
[ 6240.426070]  ffff8800bba6bd80 ffff8800bba69480 ffff8800b6af8000 ffff8800bbad62f0
[ 6240.428258]  ffff8800b6af7bc0 ffff8800b6af7bbc 0000000000000000 ffff8800b6af7b60
[ 6240.430665] Call Trace:
[ 6240.431620]  [<ffffffff81964fbe>] schedule+0x2e/0x70
[ 6240.432955]  [<ffffffff811b5c74>] wb_wait_for_completion+0x44/0x70
[ 6240.434552]  [<ffffffff81093da0>] ? wait_woken+0x80/0x80
[ 6240.435972]  [<ffffffff811b7482>] __writeback_inodes_sb_nr+0x82/0xb0
[ 6240.437585]  [<ffffffff811b74e2>] writeback_inodes_sb+0x22/0x30
[ 6240.439097]  [<ffffffff811bff37>] sync_filesystem+0x27/0xa0
[ 6240.440578]  [<ffffffff811c788f>] fsync_bdev+0x1f/0x50
[ 6240.441989]  [<ffffffff8132f4ef>] invalidate_partition+0x1f/0x40
[ 6240.443580]  [<ffffffff813307da>] del_gendisk+0xda/0x1f0
[ 6240.445030]  [<ffffffff8154888c>] sd_remove+0x5c/0xc0
[ 6240.446425]  [<ffffffff8150dde1>] __device_release_driver+0x91/0x130
[ 6240.448041]  [<ffffffff8150dea0>] device_release_driver+0x20/0x30
[ 6240.449624]  [<ffffffff8150cfaa>] bus_remove_device+0x10a/0x150
[ 6240.451162]  [<ffffffff815099e4>] device_del+0x134/0x250
[ 6240.452602]  [<ffffffff8154099d>] __scsi_remove_device+0xbd/0xd0
[ 6240.454102]  [<ffffffff815409d6>] scsi_remove_device+0x26/0x40
[ 6240.456139]  [<ffffffff8159a07b>] tcm_loop_port_unlink+0x3b/0x70
[ 6240.458176]  [<ffffffff8157afd7>] target_fabric_port_unlink+0x37/0x50
[ 6240.460403]  [<ffffffff8120a9d8>] configfs_unlink+0xe8/0x1a0
[ 6240.462367]  [<ffffffff8119921e>] vfs_unlink+0xee/0x180
[ 6240.464258]  [<ffffffff8119cca7>] do_unlinkat+0x227/0x290
[ 6240.466174]  [<ffffffff8119d641>] SyS_unlink+0x11/0x20
[ 6240.468056]  [<ffffffff8196a4b2>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 6240.470152] INFO: lockdep is turned off.
agrover commented 9 years ago

I think our timeout handling needs work. We should be completing requests from the timeout (with an error) but we're not, currently. EDIT: doh. looks like we actually do. Will keep digging.

yasker commented 8 years ago

I've spend some debug into the issue and found it's because of command deadline calculated wrong. Though when I want to submit a patch for it, found your commit is already in place in November:

commit 611e2267b68fc061aea86345b3a8b87151395187 Author: Andy Grover agrover@redhat.com Date: Fri Nov 13 10:42:19 2015 -0800

target/user: Fix time calc in expired cmd processing

Reversed arguments meant that we were doing nothing for cmds whose deadline
had passed.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>

I've confirmed this commit fixed the issue, so close it.

agrover commented 8 years ago

Cool. I thought that might be the same issue but wasn't sure.