IO capacity balancing is not well balanced

scylladb / seastar

High performance server-side application framework

http://seastar.io

Apache License 2.0

8.36k stars 1.55k forks source link

IO capacity balancing is not well balanced #1083

Open xemul opened 2 years ago

xemul commented 2 years ago

As seen on i3.large node (2 cores) in scylladb/scylla#10704

vladzcloudius commented 1 year ago

@mykaul @avikivity this is "field-urgent".

In gist, what happens is that I/O queue on a specific shard ("shard A") can get long due to a temporary overload of that shard (something heavy was going on, like a memtable flush or a stall) but even after that overload is gone if other shards also do I/O, e.g. do compactions, that long long I/O queue on shard A would remain long all that time because a new I/O scheduler is going to allocate I/O an equal amount of I/O budget to every shard that needs to do I/O at the moment.

As a result that long I/O queue is going to be causing high I/O queue latency which translates into high read latency.

This is a regression compared to the old I/O scheduler (2021.1) behavior in the same situation.

This means that a new I/O scheduler solve some problems but created new ones.

We need to prioritize the fix for this issue at the highest.

vladzcloudius commented 1 year ago

@tomer-sandler @dorlaor @harel-z FYI

avikivity commented 1 year ago

(deleted comment mentioning customers)

dorlaor commented 1 year ago

How complex is to fix this?

xemul commented 1 year ago

I've a patch that has two problems

it's not confirmed on anything but a io-tester-based reproducer
it has a "configuration parameter" that should be somehow configured by hand and there's no good ideas (yet) how to select it automatically

avikivity commented 1 year ago

Please post it, we can use it as a base for brainstorming.

mykaul commented 1 year ago

What's the latest status of this issue? (seeing if it'll make it to Scylla 5.4)

xemul commented 1 year ago

It's in lower prio, because there's a "workaround" -- one need to configure more io-groups than it's auto-detected by seastar to make groups' size smaller and thus reduce the per-group imbalance. Avi thinks it should be the default behavior.

bhalevy commented 12 months ago

It's in lower prio, because there's a "workaround" -- one need to configure more io-groups than it's auto-detected by seastar to make groups' size smaller and thus reduce the per-group imbalance. Avi thinks it should be the default behavior.

@xemul so are we making this the default behavior?

xemul commented 12 months ago

@xemul so are we making this the default behavior?

I lean towards it, but I've no good ideas how to calculate/estimate which amount of shards in a group is good enough