ZFS io throttling (blkio cgroup or smth)

mechmind commented 10 years ago

We are using lxc and zfs in our dev/qa environment farm and it works perfectly until cron kicks up some io-heavy (read and write) tasks in multiple containers. So we decided to throttle containers io using blkio cgroup capabilities. But something wrong with numbers from cgroup stats:

# cat blkio.io_serviced
8:0 Read 58971
8:0 Write 0
8:0 Sync 0
8:0 Async 58971
8:0 Total 58971
8:16 Read 69965
8:16 Write 0
8:16 Sync 0
8:16 Async 69965
8:16 Total 69965
Total 128936

# cat blkio.io_service_time
8:0 Read 217698281338
8:0 Write 0
8:0 Sync 0
8:0 Async 217698281338
8:0 Total 217698281338
8:16 Read 244414049344
8:16 Write 0
8:16 Sync 0
8:16 Async 244414049344
8:16 Total 244414049344
Total 462112330682

# cat blkio.io_service_bytes
8:0 Read 1615888384
8:0 Write 0
8:0 Sync 0
8:0 Async 1615888384
8:0 Total 1615888384
8:16 Read 1653839872
8:16 Write 0
8:16 Sync 0
8:16 Async 1653839872
8:16 Total 1653839872
Total 3269728256

So there are only async read requests and no writes at all. Same stats for blkio.throttle.io_*. Does it mean that zfs just bypass blkio qos subsystem? Is there any way to use native linux io throttling with zfs?

On other hand, in Solaris such limits are implemented using Solaris Zones. If there any way to emulate that behavior, it also can be a solution.

behlendorf commented 10 years ago

This might be addressed in master. Commit e3dc14b86182a82d99faaa5979846750d937160e added the required code to fix per-process I/O accounting. We haven't had a chance to seriously investigate integrating cgroups but this might be enough to get things working. In theory we shouldn't have to do much on the zfs to get this working.

mechmind commented 10 years ago

I tried patch from e3dc14b86182a82d99faaa5979846750d937160e and now i can see process stats in /proc/pid/io, but cgroup numbers remain the same.

Further investigation lead me into kernel sources and lkml archives. There are two mechanisms for io throttling - one for buffered reads/writes (relies on pagecache) and another for direct io (implemented somewhere around actual writes to device).

Since zfs does not use linux pagecache (correct me if i'm wrong), that option can't be used as is. On other hand, accounting on block device level does not make any sense - zfs has its own "page cache", raid, arc, l2arc, even dedicated kernel threads for io (correct me twice if i'm wrong) etc.

So far i have no idea how to properly implement cgroup integration. Any points?

behlendorf commented 10 years ago

You're right on both points. We have plans to more tightly integrate the page cache and ARC which I believe will resolve this but until that happens it sounds like I/O limiting cant be done through cgroups.

seletskiy commented 10 years ago

Is there any progress on this issue?

bdkoepke commented 9 years ago

What about this answer: https://askubuntu.com/questions/577098/cgroups-disk-io-throttling-with-zfs?

As per "Whole Disks vs Partitions" here: http://open-zfs.org/wiki/Performance_tuning#Compression, apparently zfs will set the Linux disk scheduler to noop since zfs has it's own scheduler. By enabling the cfq Linux scheduler you end up with redundant scheduling but blkio should work?

Is this assumption correct? I'm actually planning on migrating from FreeBSD for this exact feature (some jails sporadically use excess IO and that causes contention with the rest of the jails...).

@behlendorf in particular...

behlendorf commented 9 years ago

This should work in the latest ZFS release 0.6.5.2 but I personally don't use containers so there may be some caveats. @DeHackEd may have some real world insight here.

DeHackEd commented 9 years ago

iotop and blkio have no relation to each other. For example IO from a cifs (Samba) mount will show the process doing IO waiting (the percentage column) and the kernel thread doing "Disk read" when clearly blkio isn't involved at all.

Using 'cfq' allows the weighting functionality of the blkio scheduler to distribute IO to a disk in much the same way you can use cpu.shares for CPU usage, or just using "nice" but a whole control group is prioritized as a whole unit rather than per-process. This is what is encouraged by the askubuntu thread. Using other schedulers just means the weight parameter is non-functional. Other blkio parameters allow simple bytes/second or IOPS to be throttled just like you would rate-limit on a router's interface and operate regardless, so that option remains. The statistics you pasted show what these throttles could operate on.

The issue comes down to the fact that the actual write IO is done by the z_wr_iss kernel threads and not by any user processes. Scrubs, read-aheads and maybe other I/O are similarly handled by kernel threads. Having been forked from [kthreadd] they'll almost certainly be in the root control group. (They can be moved, and it's one reason I use spl_taskq_thread_dynamic=0). The ZIL might be an exception here.

Does that make sense?

seletskiy commented 9 years ago

@behlendorf: Will it make sense, if we will propose a patchset, that will introduce IO accounting PIDwise via netlink kernel facility? We're done with prototype for our own local company use (we're in urge of accounting and throttling IO right now in our container infrastructure). The main idea is to send into netlink socket, which can be accessed from user-space, information about read and writes which are made from various PIDs.

If it's suitable for inclusion into upstream, we're glad to prepare PR.

behlendorf commented 8 years ago

@seletskiy is the existing /proc/pid/io interface not working or otherwise unsuitable for this purpose? Before we invent something new let's determine why exactly we can't take advantage of the existing interfaces for this. But if that fails yes we can definitely look at pulling in the netlink solution.

seletskiy commented 8 years ago

@behlendorf: It works, but the main problem with this solution, that if you want to grab all PIDs and sort them by the caused IO load, you need to walk and do thousands open/read/close cycles for virtual /proc/pid/io files. inotify will not help.

Consider our typical case: host with 2k processes and 3k threads running across containers, and we need to collect IO load caused by that processes/threads as fast as possible for freezing processes which are making troubles. Given that, naïve solution which is prowling through /proc/pid/io interface and reading IO statistics causing 1 CPU core being loaded for 200ms out of every second. It's not good solution in any way, because 3k threads is not computation limit for given host.

behlendorf commented 8 years ago

@seletskiy I see, so it's just too heavy weight a solution. Why don't you open a pull request when your ready with the proposed netlink solution and we can look it over and see if it's something we can reasonably include.

ghost commented 8 years ago

@behlendorf, @seletskiy https://github.com/zfsonlinux/zfs/pull/4684

fcicq commented 6 years ago

Anyone still interested in this issue?

it is possible to port the scheduler by joyent which use zoneid to use blkcg instead, but it will become a one layer scheduler if succeed.

idea: in order to save the state in blkcg, use blkcg_policy_register() and activate it. this is available since kernel 4.2 at https://github.com/torvalds/linux/commit/e48453c386f39ca9ea29e9df6efef78f56746af0 , 4.3 fixed something else, so I think minimum should be 4.4. Note that BLKCG_MAX_POLS is currently 3, which allows cfq, bfq, blk-throttle to run concurrently, if it is 2 or less(before the bfq inclusion), recompiling kernel to increase this value (to 3 or 4) will be required.

use blkcg_to_cpd? to get the data pointer.

the state in current zfs_zone.c is zone_persist_t, the definition can be found in sys/zone.h. to get the blkcg I guess it is css_to_blkcg(task_css(current, io_cgrp_id)). (aka the later removed task_blkcg(), io_cgrpid is the same as blkio_cgrp_id in older version?) this can be used for zfs_zone_zio_init() as the zoneid.

https://github.com/joyent/illumos-joyent/blob/master/usr/src/uts/common/fs/zfs/zfs_zone.c https://github.com/joyent/illumos-joyent/blob/master/usr/src/uts/common/sys/zone.h

finally, use blkcg policy to change zfs-io-priority, it will be similar with cfq weight.

add: zone_walk will become list_for_each_entry(blkcg, &all_blkcgs, all_blkcgs_node) {}, sadly all_blkcgs is not an exported symbol. using blkcg_root should be ok, but how to traverse?

fcicq commented 6 years ago

Ah. the situation is harder than I thought. I dont know where to mark the request.

for example, there is wbc in the zfs_writepage(). if (wbc->wb), we should save wbc->wb->blkcg_css to somewhere... and it must be handled to zio_create. is that possible?

stale[bot] commented 4 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

openzfs / zfs

ZFS io throttling (blkio cgroup or smth) #1952