naota / linux

Linux kernel source tree
Other
3 stars 1 forks source link

bandwidth degradation on sequential write on a file #62

Open naota opened 2 years ago

naota commented 2 years ago

The bandwidth decreases while running the following fio command.

fio --filename=${MNT}/testfile --direct=1 \
        --rw=write --bs=256k \
        --ioengine=libaio --iodepth=1 \
        --fallocate=none \
        --write_bw_log=bw --write_lat_log=lat --write_iops_log=iops \
        --log_avg_msec=1000 \
        --numjobs=1 --group_reporting --name=fio-seq-write \
        --size=400GiB

At first, it's around 910 MiB/s, but in the end, it decreases to 420 MiB/s.

naota commented 2 years ago

The patch below improves the final bandwidth to 860 MiB/s. However, the modified place is too intrusive for the regular allocator. We need to contain the check into do_allocation_zoned()

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6aa92f84f465..a49196fc755a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4319,7 +4319,8 @@ static noinline int find_free_extent(struct btrfs_root *root,
        struct btrfs_block_group *bg_ret;

        /* If the block group is read-only, we can skip it entirely. */
-       if (unlikely(block_group->ro)) {
+       if (unlikely(block_group->ro) ||
+           block_group->alloc_offset == block_group->zone_capacity) {
            if (ffe_ctl->for_treelog)
                btrfs_clear_treelog_bg(block_group);
            if (ffe_ctl->for_data_reloc)
naota commented 2 years ago

So, the potential fix is like this.

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6aa92f84f465..1c566f31ff89 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3774,6 +3774,14 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,

    ASSERT(btrfs_is_zoned(block_group->fs_info));

+   if (block_group->alloc_offset == block_group->zone_capacity) {
+       if (ffe_ctl->for_treelog)
+           btrfs_clear_treelog_bg(block_group);
+       if (ffe_ctl->for_data_reloc)
+           btrfs_clear_data_reloc_bg(block_group);
+       return 1;
+   }
+
    /*
     * Do not allow non-tree-log blocks in the dedicated tree-log block
     * group, and vice versa.

However, the effectiveness of this patch means that we are not hitting a good block group with the given hint_bytes.

naota commented 2 years ago

The hint for a file extent is set from here.

https://github.com/kdave/btrfs-devel/blob/master/fs/btrfs/inode.c#L1077-L1088

When writing to a non-pre-allocated file, the hint is set to the logical address of the file beginning. When the file size is huge, that hint points to a too far block group from a non-full block group.

As a result, find_free_extent() need to iterate over filled BGs to reach the non-full BG to allocate an extent. That also cause the performance degradataion.