naota / linux

Linux kernel source tree
Other
3 stars 1 forks source link

hung_task issue by wait_event() in check_system_chunk() #51

Open morbidrsa opened 3 years ago

morbidrsa commented 3 years ago

We recently observed a hung_task issue on our 5.13-rc* test runs (without any lockdep warnings!). The code is based on Linux 5.13-rc6 + a patch to fix unbalanced unlock in qgroup_account_snapshot() [1] + some debug printings. I found the cause of the issue, but I'm not sure how we can fix it.

[1] https://lore.kernel.org/linux-btrfs/20210621012114.1884779-1-naohiro.aota@wdc.com/T/#u

In btrfs_chunk_alloc(), we set space_info->chunk_alloc = 1 before a chunk allocation and turn it to 0 after the allocation. We block other chunk allocations while space_info->chunk_alloc == 1. So, the code should be forward-progress while space_info->chunk_alloc == 1.

However, commit eafa4fd0ad06 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations") broke the assumption. It introduced wait_event() to wait for an ongoing transaction which allocated a new chunk to finish. That waiting leads to an invisible circular dependency. If the thread that allocated a chunk tries to allocate another chunk, it cannot proceed because the wait_event()'ing thread set space_info->chunk_alloc = 1.

We might need to set space_info->chunk_alloc = 0 while wait_event()? But then we need to recheck everything again from the start of btrfs_chunk_alloc()?

These messages are from check_system_chunk() after "trans->chunk_bytes_reserved += thresh;"

[ 905.785876][ T722] 722 add 393216 to transaction 233 [ 905.800494][ T5063] 5063 add 393216 to transaction 233

Thread 722 and 5063 reserved space for its chunk allocation metadata on the running transaction.

[ 905.806845][ T5070] BTRFS info (device nullb1): left=65536, need=393216, flags=4 [ 905.811547][ T5070] BTRFS info (device nullb1): space_info 2 has 65536 free, is not full ...

This message is from check_system_chunk() before the wait_event()

[ 905.831000][ T5070] waiting for reserved 786432 trans 0 min_needed 393216

Now thread 5070 does not have enough SYSTEM space (left=65536 < need=393216) for its chunk allocation. And, since other transaction_handle in the same transaction (transaction_handle of thread 722 and 5063) already reserved space for them, it now waits for them to free their reservation with transaction commit.

[ 1108.499029][ T22] INFO: task kworker/u4:5:721 blocked for more than 122 seconds.

But, nothing progressed!

So, what is happened with threads 722 and 5063?

Backtrace of thread 722

crash> bt 722 
PID: 722    TASK: ffff888100f80000  CPU: 0   COMMAND: "kworker/u4:6"
 #0 [ffffc90002467160] __schedule at ffffffff8260da25
 #1 [ffffc90002467268] preempt_schedule_common at ffffffff8260f635
 #2 [ffffc90002467288] __cond_resched at ffffffff8260f68d
 #3 [ffffc90002467298] btrfs_chunk_alloc at ffffffffa03cd50a [btrfs]
 #4 [ffffc90002467330] find_free_extent at ffffffffa01adfa2 [btrfs]
 #5 [ffffc900024674f0] btrfs_reserve_extent at ffffffffa01bf7d0 [btrfs]
 #6 [ffffc90002467580] btrfs_alloc_tree_block at ffffffffa01c074f [btrfs]
 #7 [ffffc90002467720] alloc_tree_block_no_bg_flush at ffffffffa018d9e5 [btrfs]
 #8 [ffffc90002467780] __btrfs_cow_block at ffffffffa0194fb1 [btrfs]
 #9 [ffffc900024678d0] btrfs_cow_block at ffffffffa01960b5 [btrfs]
#10 [ffffc90002467950] btrfs_search_slot at ffffffffa01a15c7 [btrfs]
#11 [ffffc90002467ac0] btrfs_remove_chunk at ffffffffa027f43c [btrfs]
#12 [ffffc90002467c50] btrfs_relocate_chunk at ffffffffa0280503 [btrfs]
#13 [ffffc90002467c88] btrfs_reclaim_bgs_work.cold at ffffffffa040e7b7 [btrfs]
#14 [ffffc90002467c90] __kasan_check_read at ffffffff8167ebf1
#15 [ffffc90002467d08] process_one_work at ffffffff811ae718
#16 [ffffc90002467e40] worker_thread at ffffffff811af8de
#17 [ffffc90002467f08] kthread at ffffffff811c0857
#18 [ffffc90002467f50] ret_from_fork at ffffffff810034b2

So, here it is. Since check_system_chunk() takes neither space_info->lock nor fs_info->chunk_mutex while the waiting, thread 722 is in busy-loop to wait for space_info->chunk_alloc == 0.

3 [ffffc90002467298] btrfs_chunk_alloc at ffffffffa03cd50a [btrfs]

/home/naota/src/linux-works/btrfs-zoned/fs/btrfs/block-group.c: 3282

And, how about thread 5063?

crash> bt 5063
PID: 5063   TASK: ffff8881ca254000  CPU: 0   COMMAND: "fsstress"
 #0 [ffffc90007d46fd8] __schedule at ffffffff8260da25
 #1 [ffffc90007d470e0] schedule at ffffffff8260f247
 #2 [ffffc90007d47110] rwsem_down_read_slowpath at ffffffff82619b9d
 #3 [ffffc90007d47268] __down_read_common at ffffffff8126492c
 #4 [ffffc90007d47318] down_read_nested at ffffffff81264da4
 #5 [ffffc90007d47368] btrfs_read_lock_root_node at ffffffffa02b5007 [btrfs]
 #6 [ffffc90007d47390] btrfs_search_slot at ffffffffa01a181b [btrfs]
 #7 [ffffc90007d47500] _raw_spin_unlock_irqrestore at ffffffff8262349c
 #8 [ffffc90007d47550] insert_with_overflow at ffffffffa01c74b9 [btrfs]
 #9 [ffffc90007d47600] btrfs_insert_dir_item at ffffffffa01c7bff [btrfs]
#10 [ffffc90007d47738] create_subvol at ffffffffa02a610b [btrfs]
#11 [ffffc90007d47970] btrfs_mksubvol at ffffffffa02a7829 [btrfs]
#12 [ffffc90007d47a08] _copy_from_user at ffffffff81b315ab
#13 [ffffc90007d47a60] btrfs_ioctl_snap_create_v2 at ffffffffa02a832f [btrfs]
#14 [ffffc90007d47ab8] btrfs_ioctl at ffffffffa02ae668 [btrfs]
#15 [ffffc90007d47d40] __x64_sys_ioctl at ffffffff8171c09f
#16 [ffffc90007d47f38] do_syscall_64 at ffffffff82602a70
#17 [ffffc90007d47f50] entry_SYSCALL_64_after_hwframe at ffffffff8280007c

So, it's contended here:

  #4 [ffffc90007d47318] down_read_nested at ffffffff81264da4
    ffffc90007d47320: [ffff88811a30bc20:btrfs_extent_buffer] 0000000000000000 
    ffffc90007d47330: 0000000000000000 ffffc90007d47360 
    ffffc90007d47340: __btrfs_tree_read_lock+40 [ffff88811a30bc20:btrfs_extent_buffer] 
    ffffc90007d47350: ffffed102e902800 dffffc0000000000 
    ffffc90007d47360: ffffc90007d47388 btrfs_read_lock_root_node+71 

Thread 5063 is trying to read lock this extent buffer.

crash> struct extent_buffer ffff88811a30bc20
struct extent_buffer {
  start = 1803321344, 
...
  lock_owner = 5065, 
...

The lock is owned by thread 5065.

crash> bt 5065
PID: 5065   TASK: ffff88812a7f0000  CPU: 1   COMMAND: "fsstress"
#0 [ffffc90007d66ab0] __schedule at ffffffff8260da25
#1 [ffffc90007d66bb8] preempt_schedule_common at ffffffff8260f635
#2 [ffffc90007d66bd8] __cond_resched at ffffffff8260f68d
#3 [ffffc90007d66be8] btrfs_chunk_alloc at ffffffffa03cd50a [btrfs]
#4 [ffffc90007d66c80] find_free_extent at ffffffffa01adfa2 [btrfs]
#5 [ffffc90007d66e40] btrfs_reserve_extent at ffffffffa01bf7d0 [btrfs]
#6 [ffffc90007d66ed0] btrfs_alloc_tree_block at ffffffffa01c074f [btrfs]
#7 [ffffc90007d67070] alloc_tree_block_no_bg_flush at ffffffffa018d9e5 [btrfs]
#8 [ffffc90007d670d0] __btrfs_cow_block at ffffffffa0194fb1 [btrfs]
#9 [ffffc90007d67220] btrfs_cow_block at ffffffffa01960b5 [btrfs]
#10 [ffffc90007d672a0] btrfs_search_slot at ffffffffa01a15c7 [btrfs]
#11 [ffffc90007d67410] btrfs_lookup_file_extent at ffffffffa01ca05e [btrfs]
#12 [ffffc90007d674a8] btrfs_drop_extents at ffffffffa0233449 [btrfs]
#13 [ffffc90007d67738] btrfs_replace_file_extents at ffffffffa023762d [btrfs]
#14 [ffffc90007d678d8] btrfs_clone at ffffffffa03d674f [btrfs]
#15 [ffffc90007d67b20] btrfs_extent_same_range at ffffffffa03d6d38 [btrfs]
#16 [ffffc90007d67b68] btrfs_remap_file_range at ffffffffa03d7b68 [btrfs]
#17 [ffffc90007d67c60] vfs_dedupe_file_range_one at ffffffff8179f58b
#18 [ffffc90007d67cd0] vfs_dedupe_file_range at ffffffff8179fadd
#19 [ffffc90007d67d40] __x64_sys_ioctl at ffffffff8171c344
#20 [ffffc90007d67f38] do_syscall_64 at ffffffff82602a70
#21 [ffffc90007d67f50] entry_SYSCALL_64_after_hwframe at ffffffff8280007c

Again! Thread 5065 is busy-looping in the same loop in btrfs_chunk_alloc().

I'm not sure how we can fix it without breaking the original intention of commit eafa4fd0ad06. We might need to set space_info->chunk_alloc = 0 while in the wait_event(). But then, we also need to recheck everything from the start of btrfs_chunk_alloc().

Or, we can commit the transaction before another chunk allocation if its transaction_handle is waited?

naota commented 3 years ago

WIP thread https://lore.kernel.org/linux-btrfs/CAL3q7H4=+7KhC=mE9uHOeN2WDrhVj+Uz_8t_Gvyt_N9rL_KhKA@mail.gmail.com/T/