io_queues: add max_band and max_iops configuration per io-queue

vladzcloudius commented 4 years ago

HEAD: 6973080cd1045675af890f155b0fcd7308b4277c

Description Our io-queue idea heavily resembles the systemd.resource-control interface: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#. However, clearly, it only has a few out of plenty configuration pivots of systemd's.

We probably don't need to implement everything at once but adding some when we need them (and we do!) definitely makes sense.

Our shares are parallel to IOWeight and I suggest to add pivots parallel to IOXxxBandwidthMax and IOXxxIOPSMax.

This would allow to limit the corresponding io-queue by a specific max bandwidth/IOPs values thereby ensuring a required QoS for other classes.

dorlaor commented 3 years ago

Max bandwidth per IO class shouldn't be hard and makes sense. It can be a smaller fix until the new IO scheduler lands

dorlaor commented 3 years ago

@xemul what do you think about this as an interim solution?

slivne commented 3 years ago

@Asias He asias@scylladb.com how far is this suggestion from what you sent a few weeks ago ?

On Thu, Dec 31, 2020 at 10:12 PM Dor Laor notifications@github.com wrote:

@xemul https://github.com/xemul what do you think about this as an interim solution?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scylladb/seastar/issues/817#issuecomment-753082913, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2OCCCDKZ7YG34XTULI3NTSXTLMRANCNFSM4S2EWLZA .

xemul commented 3 years ago

@dorlaor , from a tech perspective, if there's a non-reclaimable or hard-to-reclaim resource then limiting one or several of its consumers is equivalent to providing guarantees to them (or to the others). Sometimes limiting is the only way to provide such guarantees. As far as IO is concerned, it's exactly the case, as IO is consumed in non-reclaimable portions (requests), so if we limit one (or all) IO-group, we'll provide guarantees for the others (or to everybody). Shares cannot provide this at all.

dorlaor commented 3 years ago

Not sure if I understand, everything is related, if we limit compaction and do not limit requests, we'll blow the storage. However, I think that Vlad knows that and wants to get some half-human control since there are bugs or weird cases that we need to immediately intervene and some control offers immediate help. Vlad, is this aligned with your offer?

On Sun, Jan 3, 2021 at 8:38 PM Pavel Emelyanov notifications@github.com wrote:

@dorlaor https://github.com/dorlaor , from a tech perspective, if there's a non-reclaimable or hard-to-reclaim resource then limiting one or several of its consumers is equivalent to providing guarantees to them (or to the others). Sometimes limiting is the only way to provide such guarantees. As far as IO is concerned, it's exactly the case, as IO is consumed in non-reclaimable portions (requests), so if we limit one (or all) IO-group, we'll provide guarantees for the others (or to everybody). Shares cannot provide this at all.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/seastar/issues/817#issuecomment-753751766, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHURKCZMUJU4U2DNO4T3DSYFA4RANCNFSM4S2EWLZA .

asias commented 3 years ago

@asias how far is this suggestion from what you sent a few weeks ago ?

My patch supports limiting both max bandwidth and max iops. I think it is feature completes. There is one concern from Pavel that it will make cancellation of a request more complicated. And the cancellation working is still work in progress. The limiter work is stuck since Dec 9, 2020.

I think it is best not to make bandwidth limiter depend on the cancellation working.

1) The limiter interface for the user will not change with /without the cancellation working. 2) The user of the limiter, e.g., streaming / repair will not cancel requests at all. The classes for query IO that use cancel support will run limiter. 3) The request will be cancelled when it goes out of the limiter module. This is called passive cancellation. 4) We can backport to older version with less dependencies.

asias commented 3 years ago

Btw, on Dec 23, 2020, Benny told me Shlomi and him wanted to decide if we need the io limiter at all with the new scheduler.

dorlaor commented 3 years ago

On Sun, Jan 3, 2021 at 10:32 PM Asias He notifications@github.com wrote:

Btw, on Dec 23, 2020, Benny told me Shlomi and him wanted to decide if we need the io limiter at all with the new scheduler.

In a perfect world we may not need it but realistically I think it should be a good safety measure to have, even when we have a better scheduler

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/seastar/issues/817#issuecomment-753785805, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHURJ5BRY6NFPVJXQ6BILSYFOIDANCNFSM4S2EWLZA .

xemul commented 3 years ago

@dorlaor , the initial motivation for rate/bw limiter was to handle the impact background IO (repair, compaction) has on the "real-time" one (user requests). And my point was that if you compare two cases -- steady user IO vs sporadic user IO -- the proportional scheduler cannot provide the same latencies for the latter case as it does for the former one. We have to limit the bacground IO anyhow and the question is should the limitation knob be user-visible and mean "MB/sec + IOps/sec" or should it be fully internal to io-queue code.

@asias , the IO-cancellation patchset is ready for merging, btw.

vladzcloudius commented 3 years ago

Not sure if I understand, everything is related,

@dorlaor I think I and @xemul are completely on the same page as to what the problem is and how it can be possibly resolved. And the point is that I/O scheduler as it is today can not possibly give a solution if a bandwidth/latency guarantee is required to some I/O classes. And this is because it only provides shares-based arbitration.

if we limit compaction and do not limit requests, we'll blow the storage. However, I think that Vlad knows that and wants to get some half-human control since there are bugs or weird cases that we need to immediately intervene and some control offers immediate help. Vlad, is this aligned with your offer?

Those cases are not that "weird" and/or rare. What we need to have is exactly that guarantee to the query class that we can't possibly give today.

The fact that NVMe is a low latency and relatively low concurrency device is the thing that "saves the Scylla's day" today - otherwise we would be seeing the issue in question at every installation.

… On Sun, Jan 3, 2021 at 8:38 PM Pavel Emelyanov @.***> wrote: @dorlaor https://github.com/dorlaor , from a tech perspective, if there's a non-reclaimable or hard-to-reclaim resource then limiting one or several of its consumers is equivalent to providing guarantees to them (or to the others). Sometimes limiting is the only way to provide such guarantees. As far as IO is concerned, it's exactly the case, as IO is consumed in non-reclaimable portions (requests), so if we limit one (or all) IO-group, we'll provide guarantees for the others (or to everybody). Shares cannot provide this at all. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#817 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANHURKCZMUJU4U2DNO4T3DSYFA4RANCNFSM4S2EWLZA .

vladzcloudius commented 3 years ago

We have to limit the bacground IO anyhow and the question is should the limitation knob be user-visible and mean "MB/sec + IOps/sec" or should it be fully internal to io-queue code.

@xemul I suggest we start with the user-visible version and work of some internal io-queue automatic algorithm in the meantime. Once the later becomes stable and robust enough we can always disable user-visibility.

xemul commented 3 years ago

@vladzcloudius , sure thing, but user-visible stuff is hard to hide back. However, if by "user-visible" you mean "not documented and only configurable by scylla code" then I concur.

asias commented 3 years ago

@vladzcloudius , sure thing, but user-visible stuff is hard to hide back. However, if by "user-visible" you mean "not documented and only configurable by scylla code" then I concur.

Mark the interface experimental, subject to change in the future.

vladzcloudius commented 3 years ago

@vladzcloudius , sure thing, but user-visible stuff is hard to hide back. However, if by "user-visible" you mean "not documented and only configurable by scylla code" then I concur.

@xemul Why is it hard? You simply declare that you don't support this parameter anymore and start ignoring its values. But you should, of course, provide a better alternative when you do that. Which requires our "automagical" stuff to be at least as good as the manual tuning for every possible use case.

xemul commented 3 years ago

@vladzcloudius , sure thing, but user-visible stuff is hard to hide back. However, if by "user-visible" you mean "not documented and only configurable by scylla code" then I concur.

@xemul Why is it hard? You simply declare that you don't support this parameter anymore and start ignoring its values.

... and then users start complaining that their old configuration doesn't behave the old way any longer?

But you should, of course, provide a better alternative when you do that. Which requires our "automagical" stuff to be at least as good as the manual tuning for every possible use case.

Yeah, but the original task is to "improve the impact compaction/repair has on regular workload". Right now we're solving it with an IO-rate-limiter knob, but then may come up with something else. And if someone uses the ratelimiter knob to literally rate-limit the IO, and we stop supporting it, this someone will be surprised.

vladzcloudius commented 3 years ago

@vladzcloudius , sure thing, but user-visible stuff is hard to hide back. However, if by "user-visible" you mean "not documented and only configurable by scylla code" then I concur.

@xemul Why is it hard? You simply declare that you don't support this parameter anymore and start ignoring its values.

... and then users start complaining that their old configuration doesn't behave the old way any longer?

We will deal with that as long as we have a better alternative. This already happened before in the I/O scheduler domain (we ended the support for background_writer_scheduling_quota and auto_adjust_flush_quota parameters). That should be the least of your concerns... ;)

But you should, of course, provide a better alternative when you do that. Which requires our "automagical" stuff to be at least as good as the manual tuning for every possible use case.

Yeah, but the original task is to "improve the impact compaction/repair has on regular workload". Right now we're solving it with an IO-rate-limiter knob, but then may come up with something else. And if someone uses the ratelimiter knob to literally rate-limit the IO, and we stop supporting it, this someone will be surprised.

As long as this is a good "surprise" everybody should be happy.

bhalevy commented 3 years ago

My preference would be to have a relative limit rather than an absolute one that needs to be configured per node.

Maybe we can simply use the priority class shares to define the ratio, maybe adding flag to use it for limiting iops/throughput, or add optional shares for the limiter associated with the priority class.

vladzcloudius commented 3 years ago

My preference would be to have a relative limit rather than an absolute one that needs to be configured per node.

Maybe we can simply use the priority class shares to define the ratio, maybe adding flag to use it for limiting iops/throughput, or add optional shares for the limiter associated with the priority class.

Our current implementation of I/O scheduler is exactly a "relative limit" - we need an absolute limit as has been described in detail above, @bhalevy .

bhalevy commented 3 years ago

Configuring the limiter using a ratio can be translated internally to an absolute value per node/shard and be updated dynamically based on the effective throughput/iops delivered by the disk.

What configuration mechanism do you propose? How about heterogeneous clusters where different nodes may have different I/O configuration?

vladzcloudius commented 3 years ago

On 1/27/21 10:19 AM, Benny Halevy wrote:

Configuring the limiter using a ratio can be translated internally to an absolute value per node/shard and be updated dynamically based on the effective throughput/iops delivered by the disk.

This proposal is going to be exposed to a resonance effect if what you mean is it to measure the effective disk performance in a real time and re-adjust the effective max throughput of a class accordingly. If you refer a value written in the io_properties.yaml as an "effective throughput/iops delivered by the disk" then what you suggest is technically the same as what I'm suggesting.

What configuration mechanism do you propose?

My proposal is described in the opening message: have a per-node configurable that defines a maximum bandwidth per I/O class. Similar to what we have for configuring static shares for certain I/O classes today.

How about heterogeneous clusters where different nodes may have different I/O configuration?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/seastar/issues/817#issuecomment-768357474, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOSHLIOHQTUU6WVMYXHTW3S4AVJVANCNFSM4S2EWLZA.

bhalevy commented 3 years ago

My proposal is described in the opening message: have a per-node configurable that defines a maximum bandwidth per I/O class. Similar to what we have for configuring static shares for certain I/O classes today.

I think this may work smoothly and automatically when installing a new cluster but it's going to be a nightmare to change and to apply to newly installed nodes.

I think we need to take example from the workload prioritisation work that @eliransin did and provide an easy way to centrally configure the limit, e.g. limit the maintenance priority class to 20% of the bandwidth or iops. Then each node/shard can apply this ratio to the static throughput/iops numbers it was loaded with (*)

(*) when we know how to do it dynamically while damping oscillations, potentially readjust it from time to time, but that's not required to begin with)

vladzcloudius commented 3 years ago

On 1/27/21 11:28 AM, Benny Halevy wrote:

My proposal is described in the opening message: have a per-node
configurable that defines a maximum bandwidth per I/O class.
Similar to what we have for configuring static shares for certain
I/O classes today.
I think this may work smoothly and automatically when installing a new cluster but it's going to be a nightmare to change and to apply to newly installed nodes.

The same automation that installs the new cluster should (and does) deal with installing new nodes. So, that's not a problem.

I think we need to take example from the workload prioritisation work that @eliransin https://github.com/eliransin did and provide an easy way to centrally configure the limit, e.g. limit the maintenance priority class to 20% of the bandwidth or iops. Then each node/shard can apply this ratio to the static throughput/iops numbers it was loaded with (*)

Using virtual tables (or alike) for modifying scylla parameters is an improvement that would be useful for any parameter currently configured via scylla.yaml. I suggest not to mix these things together and have a separate GH issue about that.

(*) when we know how to do it dynamically while damping oscillations, potentially readjust it from time to time, but that's not required to begin with)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/seastar/issues/817#issuecomment-768405744, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOSHLJVWXIIOIZGM4OIYR3S4A5MRANCNFSM4S2EWLZA.

xemul commented 3 years ago

Should the limit be per-shard or node-wide (well, NUMA-node wide maybe)? @vladzcloudius , @bhalevy what do you think?

bhalevy commented 3 years ago

Should the limit be per-shard or node-wide (well, NUMA-node wide maybe)? @vladzcloudius , @bhalevy what do you think?

There are 2 question. First, how is it defined? I think that's pretty clear that it should be defined for the whole node.

Second, how is it enforced? Ideally it would be good to enforce it in a distributed way across shards, but how hard is it and would it hurt performance? For compaction and repair, the I/O demands should even out across the shards on average, but there could be pathological cases. For example, @nyh just mentioned in an email thread with a user regarding how to deal with out-of-space during compaction, that a workaround could be to not compact in parallel on all shards, but rather reduce the parallelism and serialize compactions across shards to reduce the temporary space requirements. We may want to choose to do so automatically in the future, given a disk-space controller, and then the i/o demands won't be symmetrical.

The bottom line IMO is that if enforcing a global limit would be hard / risky, we can start simple and just divide the limit evenly between the shards.

vladzcloudius commented 3 years ago

On 1/28/21 12:37 PM, Benny Halevy wrote:

Should the limit be per-shard or node-wide (well, NUMA-node wide
maybe)? @vladzcloudius <https://github.com/vladzcloudius> ,
@bhalevy <https://github.com/bhalevy> what do you think?
There are 2 question. First, how is it defined? I think that's pretty clear that it should be defined for the whole node.

Agree.

Second, how is it enforced? Ideally it would be good to enforce it in a distributed way across shards, but how hard is it and would it hurt performance? For compaction and repair, the I/O demands should even out across the shards on average, but there could be pathological cases. For example, @nyh https://github.com/nyh just mentioned in an email thread with a user regarding how to deal with out-of-space during compaction, that a workaround could be to not compact in parallel on all shards, but rather reduce the parallelism and serialize compactions across shards to reduce the temporary space requirements. We may want to choose to do so automatically in the future, given a disk-space controller, and then the i/o demands won't be symmetrical.

The bottom line IMO is that if enforcing a global limit would be hard / risky, we can start simple and just divide the limit evenly between the shards.

Agree again. Let's start with the simple implementation that mimics what the current I/O scheduler does - divides the corresponding capability evenly among present shards. And then we can improve it incrementally.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/seastar/issues/817#issuecomment-769253763, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOSHLNNN67HSNC4BAX2K3LS4GOFZANCNFSM4S2EWLZA.

slivne commented 3 years ago

@avikivity can you please provide your input we want to implement this.

avikivity commented 3 years ago

I'm not a huge fan, but also not against it. Certainly there is no Seastar reason against it.

scylladb / seastar

io_queues: add max_band and max_iops configuration per io-queue #817