naota / linux

Linux kernel source tree
Other
3 stars 1 forks source link

hung_task by running RocksDB load on Btrfs backed by ZNS SSD #59

Open MaisenbacherD opened 2 years ago

MaisenbacherD commented 2 years ago

When giving a RocksDB workload on a Btrfs formated ZNS SSD a hung_task issue occours:

[  726.212488] INFO: task rocksdb:high0:23730 blocked for more than 120 seconds.
[  726.214205]       Not tainted 5.15.0-rc7+ #3
[  726.215205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.217033] task:rocksdb:high0   state:D stack:    0 pid:23730 ppid: 23726 flags:0x00000000
[  726.217040] Call Trace:
[  726.217047]  __schedule+0x2e4/0x900
[  726.217057]  ? memcg_check_events+0xbb/0x1b0
[  726.217064]  schedule+0x4e/0xb0
[  726.217068]  io_schedule+0x16/0x40
[  726.217073]  wait_on_page_bit_common+0x16e/0x3b0
[  726.217079]  ? xas_find+0x183/0x1b0
[  726.217085]  ? filemap_invalidate_unlock_two+0x40/0x40
[  726.217090]  __lock_page+0x4c/0x50
[  726.217094]  truncate_inode_pages_range+0x619/0x770
[  726.217105]  truncate_inode_pages_final+0x4d/0x60
[  726.217110]  btrfs_evict_inode+0x67/0x480
[  726.217117]  evict+0xd2/0x180
[  726.217122]  iput+0x189/0x200
[  726.217125]  do_unlinkat+0x1c0/0x2b0
[  726.217130]  __x64_sys_unlink+0x23/0x30
[  726.217133]  do_syscall_64+0x3b/0xc0
[  726.217138]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  726.217143] RIP: 0033:0x7f0d5deab71b
[  726.217148] RSP: 002b:00007f0d5d594c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[  726.217153] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0d5deab71b
[  726.217156] RDX: 00007f0d5d595040 RSI: 00005650330a4c20 RDI: 00007f0d58022670
[  726.217158] RBP: 00007f0d5d594d00 R08: 0000000000000000 R09: 0000000000000000
[  726.217161] R10: 00007f0d5d5944d0 R11: 0000000000000246 R12: 00007f0d5d595000
[  726.217163] R13: 00007f0d5d595040 R14: 0000000000000000 R15: 00007f0d5d597260

The steps to reproduce this behavior:

Kernel in use: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next at commit 23b461ade5edb7861021c6cddd5bca9b6c9285c2

Installing RocksDB's db_bench:

cd ~
mkdir -p rocksdb-env
cd rocksdb-env
git clone https://github.com/facebook/rocksdb.git
cd rocksdb
sudo DEBUG_LEVEL=0 make -j8 db_bench install
git clone https://github.com/westerndigitalcorporation/zenfs plugin/zenfs

Format ZNS SSD (here /dev/nvme0n1) with Btrfs and prepare it for the RocksDB benchmark:

#!/bin/bash

set -e

# ZNS drive
dev="nvme0n1"
sudo sh -c "echo 'mq-deadline' > /sys/block/${dev}/queue/scheduler"

sudo mkfs.btrfs /dev/$dev -f
sudo mkdir -p /mnt/btrfs
sudo mount -t btrfs /dev/$dev /mnt/btrfs
sudo mkdir /mnt/btrfs/testrun

Now run the quick_performance test, which causes the hung_task:

cd ~/rocksdb-env/rocksdb/plugin/zenfs/tests
sudo FS_PARAMS="-db=/mnt/btrfs/testrun" ./run.sh btrfs_testrun quick_performance
morbidrsa commented 2 years ago

Hi Dennis,

Can you do me a favour? Can you dump the content of /sys/kernel/debug/block/nvme0n1/*

MaisenbacherD commented 2 years ago

Hi Johannes,

Sure here is the dump: rg "" /sys/kernel/debug/block/nvme0n1/

poll_stat:read  (512 Bytes): samples=0
poll_stat:write (512 Bytes): samples=0
poll_stat:read  (1024 Bytes): samples=0
poll_stat:write (1024 Bytes): samples=0
poll_stat:read  (2048 Bytes): samples=0
poll_stat:write (2048 Bytes): samples=0
poll_stat:read  (4096 Bytes): samples=0
poll_stat:write (4096 Bytes): samples=0
poll_stat:read  (8192 Bytes): samples=0
poll_stat:write (8192 Bytes): samples=0
poll_stat:read  (16384 Bytes): samples=0
poll_stat:write (16384 Bytes): samples=0
poll_stat:read  (32768 Bytes): samples=0
poll_stat:write (32768 Bytes): samples=0
poll_stat:read  (65536 Bytes): samples=0
poll_stat:write (65536 Bytes): samples=0
state:SAME_COMP|NONROT|IO_STAT|DISCARD|INIT_DONE|STATS|REGISTERED|PCI_P2PDMA|ZONE_RESETALL|NOWAIT
write_hints:hint0: 0
write_hints:hint1: 0
write_hints:hint2: 0
write_hints:hint3: 0
write_hints:hint4: 0
pm_only:0
hctx1/sched_tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/sched_tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/state:
hctx0/flags:alloc_policy=FIFO SHOULD_MERGE
hctx0/cpu0/merged:0
hctx0/cpu0/dispatched:676 24
hctx0/cpu0/completed:676 24
hctx0/ctx_map:00000000: 00
hctx0/tags:nr_tags=1023
hctx0/tags:nr_reserved_tags=0
hctx0/tags:active_queues=0
hctx0/tags:
hctx0/tags:bitmap_tags:
hctx0/tags:depth=1023
hctx0/tags:busy=0
hctx0/tags:cleared=20
hctx0/tags:bits_per_word=64
hctx0/tags:map_nr=16
hctx0/tags:alloc_hint={595, 645, 437, 333}
hctx0/tags:wake_batch=8
hctx0/tags:wake_index=0
hctx0/tags:ws_active=0
hctx0/tags:ws={
hctx0/tags: {.wait_cnt=8, .wait=inactive},
hctx0/tags: {.wait_cnt=8, .wait=inactive},
hctx0/tags: {.wait_cnt=8, .wait=inactive},
hctx0/tags: {.wait_cnt=8, .wait=inactive},
hctx0/tags: {.wait_cnt=8, .wait=inactive},
hctx0/tags: {.wait_cnt=8, .wait=inactive},
hctx0/tags: {.wait_cnt=8, .wait=inactive},
hctx0/tags: {.wait_cnt=8, .wait=inactive},
hctx0/tags:}
hctx0/tags:round_robin=0
hctx0/tags:min_shallow_depth=4294967295
hctx0/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/sched_tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/sched_tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
hctx0/io_poll:considered=0
hctx0/io_poll:invoked=0
hctx0/io_poll:success=0
hctx0/sched_tags:nr_tags=256
hctx0/sched_tags:nr_reserved_tags=0
hctx0/sched_tags:active_queues=0
hctx0/sched_tags:
hctx0/sched_tags:bitmap_tags:
hctx0/sched_tags:depth=256
hctx0/sched_tags:busy=0
hctx0/sched_tags:cleared=29
hctx0/sched_tags:bits_per_word=64
hctx0/sched_tags:map_nr=4
hctx0/sched_tags:alloc_hint={220, 30, 105, 40}
hctx0/sched_tags:wake_batch=8
hctx0/sched_tags:wake_index=0
hctx0/sched_tags:ws_active=0
hctx0/sched_tags:ws={
hctx0/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx0/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx0/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx0/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx0/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx0/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx0/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx0/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx0/sched_tags:}
hctx0/sched_tags:round_robin=0
hctx0/sched_tags:min_shallow_depth=192
hctx0/queued:700
hctx0/dispatched:       0   0
hctx0/dispatched:       1   617
hctx0/dispatched:       2   0
hctx0/dispatched:       4   0
hctx0/dispatched:       8   0
hctx0/dispatched:      16   0
hctx0/dispatched:      32+  0
hctx0/run:670
hctx0/dispatch_busy:0
hctx0/type:default
hctx0/active:0
hctx1/state:
hctx1/flags:alloc_policy=FIFO SHOULD_MERGE
hctx1/ctx_map:00000000: 00
hctx1/tags:nr_tags=1023
hctx1/tags:nr_reserved_tags=0
hctx1/tags:active_queues=0
hctx1/tags:
hctx1/tags:bitmap_tags:
hctx1/tags:depth=1023
hctx1/tags:busy=0
hctx1/tags:cleared=29
hctx1/tags:bits_per_word=64
hctx1/tags:map_nr=16
hctx1/tags:alloc_hint={865, 860, 824, 140}
hctx1/tags:wake_batch=8
hctx1/tags:wake_index=0
hctx1/tags:ws_active=0
hctx1/tags:ws={
hctx1/tags: {.wait_cnt=8, .wait=inactive},
hctx1/tags: {.wait_cnt=8, .wait=inactive},
hctx1/tags: {.wait_cnt=8, .wait=inactive},
hctx1/tags: {.wait_cnt=8, .wait=inactive},
hctx1/tags: {.wait_cnt=8, .wait=inactive},
hctx1/tags: {.wait_cnt=8, .wait=inactive},
hctx1/tags: {.wait_cnt=8, .wait=inactive},
hctx1/tags: {.wait_cnt=8, .wait=inactive},
hctx1/tags:}
hctx1/tags:round_robin=0
hctx1/tags:min_shallow_depth=4294967295
hctx1/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
hctx1/io_poll:considered=0
hctx1/io_poll:invoked=0
hctx1/io_poll:success=0
hctx1/sched_tags:nr_tags=256
hctx1/sched_tags:nr_reserved_tags=0
hctx1/sched_tags:active_queues=0
hctx1/sched_tags:
hctx1/sched_tags:bitmap_tags:
hctx1/sched_tags:depth=256
hctx1/sched_tags:busy=0
hctx1/sched_tags:cleared=29
hctx1/sched_tags:bits_per_word=64
hctx1/sched_tags:map_nr=4
hctx1/sched_tags:alloc_hint={25, 94, 77, 111}
hctx1/sched_tags:wake_batch=8
hctx1/sched_tags:wake_index=0
hctx1/sched_tags:ws_active=0
hctx1/sched_tags:ws={
hctx1/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx1/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx1/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx1/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx1/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx1/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx1/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx1/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx1/sched_tags:}
hctx1/sched_tags:round_robin=0
hctx1/sched_tags:min_shallow_depth=192
hctx1/run:3949
hctx1/active:0
hctx1/queued:3691
hctx1/dispatch_busy:0
hctx1/dispatched:       0   0
hctx1/dispatched:       1   3678
hctx1/dispatched:       2   0
hctx1/dispatched:       4   0
hctx1/dispatched:       8   0
hctx1/dispatched:      16   0
hctx1/dispatched:      32+  0
hctx1/type:default
hctx1/cpu1/dispatched:3690 1
hctx2/queued:4312
hctx2/state:
hctx2/flags:alloc_policy=FIFO SHOULD_MERGE
hctx2/dispatched:       0   0
hctx2/dispatched:       1   4296
hctx2/dispatched:       2   2
hctx2/dispatched:       4   0
hctx2/dispatched:       8   1
hctx2/dispatched:      16   0
hctx2/dispatched:      32+  0
hctx2/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/ctx_map:00000000: 00
hctx2/tags:nr_tags=1023
hctx2/tags:nr_reserved_tags=0
hctx2/tags:active_queues=0
hctx2/tags:
hctx2/tags:bitmap_tags:
hctx2/tags:depth=1023
hctx2/tags:busy=0
hctx2/tags:cleared=22
hctx2/tags:bits_per_word=64
hctx2/tags:map_nr=16
hctx2/tags:alloc_hint={487, 117, 149, 543}
hctx2/tags:wake_batch=8
hctx2/tags:wake_index=0
hctx2/tags:ws_active=0
hctx2/tags:ws={
hctx2/tags: {.wait_cnt=8, .wait=inactive},
hctx2/tags: {.wait_cnt=8, .wait=inactive},
hctx2/tags: {.wait_cnt=8, .wait=inactive},
hctx2/tags: {.wait_cnt=8, .wait=inactive},
hctx2/tags: {.wait_cnt=8, .wait=inactive},
hctx2/tags: {.wait_cnt=8, .wait=inactive},
hctx2/tags: {.wait_cnt=8, .wait=inactive},
hctx2/tags: {.wait_cnt=8, .wait=inactive},
hctx2/tags:}
hctx2/tags:round_robin=0
hctx2/tags:min_shallow_depth=4294967295
hctx2/run:4655
hctx2/active:0
hctx2/sched_tags:nr_tags=256
hctx2/sched_tags:nr_reserved_tags=0
hctx2/sched_tags:active_queues=0
hctx2/sched_tags:
hctx2/sched_tags:bitmap_tags:
hctx2/sched_tags:depth=256
hctx2/sched_tags:busy=0
hctx2/sched_tags:cleared=57
hctx2/sched_tags:bits_per_word=64
hctx2/sched_tags:map_nr=4
hctx2/sched_tags:alloc_hint={204, 18, 84, 249}
hctx2/sched_tags:wake_batch=8
hctx2/sched_tags:wake_index=0
hctx2/sched_tags:ws_active=0
hctx2/sched_tags:ws={
hctx2/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx2/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx2/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx2/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx2/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx2/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx2/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx2/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx2/sched_tags:}
hctx2/sched_tags:round_robin=0
hctx2/sched_tags:min_shallow_depth=192
hctx2/io_poll:considered=0
hctx2/io_poll:invoked=0
hctx2/io_poll:success=0
hctx2/type:default
hctx2/sched_tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/sched_tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
hctx2/dispatch_busy:0
hctx1/cpu1/merged:0
hctx1/cpu1/completed:3690 1
hctx2/cpu2/dispatched:4295 17
hctx2/cpu2/completed:4295 17
hctx2/cpu2/merged:0
hctx3/active:0
hctx3/dispatched:       0   0
hctx3/dispatched:       1   2660
hctx3/dispatched:       2   0
hctx3/dispatched:       4   0
hctx3/dispatched:       8   0
hctx3/dispatched:      16   0
hctx3/dispatched:      32+  0
sched/owned_by_driver:0 0 0
hctx3/sched_tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/sched_tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/state:
sched/batching:1
hctx3/ctx_map:00000000: 00
hctx3/flags:alloc_policy=FIFO SHOULD_MERGE
sched/starved:0
hctx3/tags:nr_tags=1023
hctx3/tags:nr_reserved_tags=0
hctx3/tags:active_queues=0
hctx3/tags:
hctx3/tags:bitmap_tags:
hctx3/tags:depth=1023
hctx3/tags:busy=0
hctx3/tags:cleared=16
hctx3/tags:bits_per_word=64
hctx3/tags:map_nr=16
hctx3/tags:alloc_hint={22, 494, 538, 271}
hctx3/tags:wake_batch=8
hctx3/tags:wake_index=0
hctx3/tags:ws_active=0
hctx3/tags:ws={
hctx3/tags: {.wait_cnt=8, .wait=inactive},
hctx3/tags: {.wait_cnt=8, .wait=inactive},
hctx3/tags: {.wait_cnt=8, .wait=inactive},
hctx3/tags: {.wait_cnt=8, .wait=inactive},
hctx3/tags: {.wait_cnt=8, .wait=inactive},
hctx3/tags: {.wait_cnt=8, .wait=inactive},
hctx3/tags: {.wait_cnt=8, .wait=inactive},
hctx3/tags: {.wait_cnt=8, .wait=inactive},
hctx3/tags:}
hctx3/tags:round_robin=0
hctx3/tags:min_shallow_depth=4294967295
sched/async_depth:192
hctx3/tags_bitmap:00000000: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/tags_bitmap:00000010: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/tags_bitmap:00000020: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/tags_bitmap:00000030: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/tags_bitmap:00000040: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/tags_bitmap:00000050: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/tags_bitmap:00000060: 0000 0000 0000 0000 0000 0000 0000 0000
hctx3/tags_bitmap:00000070: 0000 0000 0000 0000 0000 0000 0000 0000
sched/queued:0 0 0
hctx3/io_poll:considered=0
hctx3/io_poll:invoked=0
hctx3/io_poll:success=0
hctx3/queued:2660
hctx3/sched_tags:nr_tags=256
hctx3/sched_tags:nr_reserved_tags=0
hctx3/sched_tags:active_queues=0
hctx3/sched_tags:
hctx3/sched_tags:bitmap_tags:
hctx3/sched_tags:depth=256
hctx3/sched_tags:busy=0
hctx3/sched_tags:cleared=16
hctx3/sched_tags:bits_per_word=64
hctx3/sched_tags:map_nr=4
hctx3/sched_tags:alloc_hint={104, 142, 94, 177}
hctx3/sched_tags:wake_batch=8
hctx3/sched_tags:wake_index=0
hctx3/sched_tags:ws_active=0
hctx3/sched_tags:ws={
hctx3/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx3/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx3/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx3/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx3/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx3/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx3/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx3/sched_tags:   {.wait_cnt=8, .wait=inactive},
hctx3/sched_tags:}
hctx3/sched_tags:round_robin=0
hctx3/sched_tags:min_shallow_depth=192
hctx3/dispatch_busy:0
rqos/wbt/curr_win_nsec:100000000
rqos/wbt/id:0
hctx3/run:2874
rqos/wbt/min_lat_nsec:2000000
rqos/wbt/enabled:1
rqos/wbt/wb_normal:8
rqos/wbt/inflight:0: inflight 0
rqos/wbt/inflight:1: inflight 0
rqos/wbt/inflight:2: inflight 0
rqos/wbt/unknown_cnt:0
hctx3/type:default
rqos/wbt/wb_background:4
hctx3/cpu3/merged:0
hctx3/cpu3/dispatched:2655 5
hctx3/cpu3/completed:2655 5
MatiasBjorling commented 2 years ago

@morbidrsa @naota @damien-lemoal

Do you know if ZNS should work now? Which branch should Dennis test ?

morbidrsa commented 2 years ago

@MatiasBjorling @MaisenbacherD I think it is the same problem @kawasaki is hitting on null_blk with xfstests' generic/127, so it's not ZNS specific but something different.

It looks like page write back is stuck somewhere (or the write back bit is not getting cleared).

morbidrsa commented 2 years ago

@MaisenbacherD I think I do have a temporary workaround for you:

echo 0 > /sys/fs/btrfs/$FSID/bg_reclaim_threshold

Hope it helps, fingers crossed

morbidrsa commented 2 years ago

Or even better, can you try with this patch:

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index c917867a4261..9cdef5e8f6b7 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1915,6 +1915,7 @@ int btrfs_zone_finish(struct btrfs_block_group *block_group)
        block_group->alloc_offset = block_group->zone_capacity;
        block_group->free_space_ctl->free_space = 0;
        btrfs_clear_treelog_bg(block_group);
+       btrfs_clear_data_reloc_bg(block_group);
        spin_unlock(&block_group->lock);

        ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_FINISH,
@@ -1997,6 +1998,7 @@ void btrfs_zone_finish_endio(struct btrfs_fs_info *fs_info, u64 logical, u64 len
        ASSERT(block_group->alloc_offset == block_group->zone_capacity);
        ASSERT(block_group->free_space_ctl->free_space == 0);
        btrfs_clear_treelog_bg(block_group);
+       btrfs_clear_data_reloc_bg(block_group);
        spin_unlock(&block_group->lock);

        map = block_group->physical_map;
morbidrsa commented 2 years ago

This hunk is also needed:

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a79001caffae..212c94c9ce97 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3798,9 +3798,8 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
         */
        spin_lock(&fs_info->relocation_bg_lock);
        data_reloc_bytenr = fs_info->data_reloc_bg;
-       if (data_reloc_bytenr &&
-           ((ffe_ctl->for_data_reloc && bytenr != data_reloc_bytenr) ||
-            (!ffe_ctl->for_data_reloc && bytenr == data_reloc_bytenr)))
+       if ((ffe_ctl->for_data_reloc && bytenr != data_reloc_bytenr) ||
+           (!ffe_ctl->for_data_reloc && bytenr == data_reloc_bytenr))
                skip = true;
        spin_unlock(&fs_info->relocation_bg_lock);
        if (skip)

Full verification using fstests is still ongoing from my side.

MaisenbacherD commented 2 years ago

@morbidrsa Thank you for the suggestions! I applied your two patches on top of https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next at commit 279373dee83ee91b65a30c10792ab8c96114a201.

The patches (as well as setting bg_reclaim_threshold to 0) are not fixing the issue. Getting a hung_task with a slightly different trace:

[  726.328648] INFO: task rocksdb:high0:11085 blocked for more than 241 seconds.
[  726.329839]       Not tainted 5.16.0-rc1+ #1
[  726.330484] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  726.331603] task:rocksdb:high0   state:D stack:    0 pid:11085 ppid: 11082 flags:0x00000000
[  726.331608] Call Trace:
[  726.331611]  <TASK>
[  726.331614]  __schedule+0x2e5/0x9d0
[  726.331622]  schedule+0x58/0xd0
[  726.331626]  io_schedule+0x3f/0x70
[  726.331629]  __folio_lock+0x125/0x200
[  726.331634]  ? find_get_entries+0x1bc/0x240
[  726.331638]  ? filemap_invalidate_unlock_two+0x40/0x40
[  726.331642]  truncate_inode_pages_range+0x5b2/0x770
[  726.331649]  truncate_inode_pages_final+0x44/0x50
[  726.331653]  btrfs_evict_inode+0x67/0x480
[  726.331658]  evict+0xd0/0x180
[  726.331661]  iput+0x13f/0x200
[  726.331664]  do_unlinkat+0x1c0/0x2b0
[  726.331668]  __x64_sys_unlink+0x23/0x30
[  726.331670]  do_syscall_64+0x3b/0xc0
[  726.331674]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  726.331677] RIP: 0033:0x7fb9490a171b
[  726.331681] RSP: 002b:00007fb943ffac68 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
[  726.331684] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb9490a171b
[  726.331686] RDX: 00007fb943ffb040 RSI: 000055a6bbe6ec20 RDI: 00007fb94400d300
[  726.331687] RBP: 00007fb943ffad00 R08: 0000000000000000 R09: 0000000000000000
[  726.331688] R10: 0000000000000031 R11: 0000000000000246 R12: 00007fb943ffb000
[  726.331690] R13: 00007fb943ffb040 R14: 0000000000000000 R15: 00007fb943ffd260
[  726.331693]  </TASK>
damien-lemoal commented 2 years ago

Before the introduction of folios, the hang was on wait_on_page_bit_common. Now it is on __folio_lock. This probably means that this is the same deadlock pattern: one of the pages to be evicted is marked for writeback (locked), but never unlocked. Is this a race with rebalance ? The extents of the file that is being deleted are being moved to another zone, but the extent pages never unlocked once the write finishes ? This means that the pattern can only happen if some of the extents of the file have not yet been written out, but still being looked at by rebalance... Or is rebalance ignoring unwritten extents with invalid pages, but does not unlock the pages ? Some race of this sort seems the likely root cause.

morbidrsa commented 2 years ago

Balance is just using the normal buffered IO path (it litterly calls into readpages and writepages), so no extra deadlock there. And as Dennis deactivated auto reclaim I suspect it's not guilty this time.

But I have no idea either.

morbidrsa commented 2 years ago

Can you check current Linus' master + https://lore.kernel.org/linux-btrfs/20211213034338.949507-1-naohiro.aota@wdc.com/

MaisenbacherD commented 2 years ago

I have checkout out Linus' master at a7904a538933c525096ca2ccde1e60d0ee62c08e and applied Naohiro's patch. Can't test if the hung_task is resolved. I am getting a "No space left on device" now.

I did an blkzone report and it looks like the maximum number of open zones is reached.

Also getting a "No space left on device" with a simple fio workload on a 4TB ZNS drive with btrfs:

sudo fio --filename=/mnt/btrfs/testrun/testfile --size=800G --direct=1 --rw=randwrite --bs=64K --iodepth=1 --verify=crc32c --name=test

However after doing a blkzone report there are just 5 zones implicitly open. 152 zones are full and 3533 zones are empty.

Did not have time to investigate deeper than that till now. Could also be that this is a discussion for another thread.