feature: reporting bdi congested state

fcicq commented 6 years ago

System information

Type	Version/Name
Distribution Name	*
Distribution Version	*
Linux Kernel	4.2+
Architecture	*
ZFS Version	*
SPL Version	*

Describe the problem you're observing

linux kernel 4.2 introduced CONFIG_CGROUP_WRITEBACK. you can see the added BDI_CAP_CGROUP_WRITEBACK in https://github.com/torvalds/linux/commit/89e9b9e07a390c50980d10aa37a04631db5a23ab

sb->s_bdi->capabilities = BDI_CAP_CGROUP_WRITEBACK is missing in zfs_domount in zfs_vfsops.c, after zpl_bdi_setup().

in order to match the functionality of linux bdi(backing_dev_info), from wb_congested, it is clear that, either to maintain the state via set_bdi_congested/clear_bdi_congested (BLK_RW_SYNC/BLK_RW_ASYNC), or implement congested_fn like dm/md do.

static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
{
    struct backing_dev_info *bdi = wb->bdi;

    if (bdi->congested_fn)
        return bdi->congested_fn(bdi->congested_data, cong_bits);
    return wb->congested->state & cong_bits;
}

one possible method would be comparing queue depth with vdev_queue_class_max_active() for ZIO_PRIORITY_SYNC_WRITE & ZIO_PRIORITY_ASYNC_WRITE. so I think vdev_queue_class_to_issue() could do the bit flips for me, if it returns ZIO_PRIORITY_NUM_QUEUEABLE, mark as congested. any io_done (if p == ZIO_PRIORITY_ASYNC_WRITE) to undo? only supporting BLK_RW_ASYNC is also acceptable (like nfs).

I'm also working for #1952 to be resolved, if someone is interested.

Describe how to reproduce the problem

N/A

Include any warning/errors/backtraces from the system logs

N/A

behlendorf commented 6 years ago

Usingbdi->congested_fn looks like a nice way to implement this. But I don't think vdev_queue_class_to_issue() is exactly what you want since it will only return ZIO_PRIORITY_NUM_QUEUEABLE when all of the queues are full. My suggestion would be to implement a new function in this layer which defines what congestion is. Thanks for working on this!

fcicq commented 6 years ago

@behlendorf Is this correct? for both sync/async, return txg_stalled(spa->spa_dsl_pool); (and spa is zsb->z_os->os_spa?)

behlendorf commented 6 years ago

@fcicq I'd recommend dsl_pool_need_dirty_delay() instead for async/sync writes, it will kick in the when the dmu starts to throttle new transactions because the dirty limit has been exceeded.

fcicq commented 6 years ago

@behlendorf thanks. that is a simplified method... so the both checks should be preserved, even txg_kick will never be called on this path.

static int zfs_congested_fn(void *congested_data, int bdi_bits)
{
       spa_t *spa = congested_data;
       dsl_pool_t *dp = spa->spa_dsl_pool;
       return dsl_pool_need_dirty_delay(dp) || txg_stalled(dp);
}

and in zfs_domount() do :
+       sb->s_bdi->congested_fn = zfs_congested_fn;
+       sb->s_bdi->congested_data = zsb->z_os->os_spa;
+       sb->s_bdi->capabilities = BDI_CAP_CGROUP_WRITEBACK;

sb->s_iflags |= SB_I_CGROUPWB; is for later (enabled for dev), another issue though. I need wbc_account_io() but exported by EXPORT_SYMBOL_GPL :(

Now this one is resolved. but I cant do pull request atm ... sorry.

openzfs / zfs