Open xemul opened 2 years ago
@mykaul @avikivity this is "field-urgent".
In gist, what happens is that I/O queue on a specific shard ("shard A") can get long due to a temporary overload of that shard (something heavy was going on, like a memtable flush or a stall) but even after that overload is gone if other shards also do I/O, e.g. do compactions, that long long I/O queue on shard A would remain long all that time because a new I/O scheduler is going to allocate I/O an equal amount of I/O budget to every shard that needs to do I/O at the moment.
As a result that long I/O queue is going to be causing high I/O queue latency which translates into high read latency.
This is a regression compared to the old I/O scheduler (2021.1) behavior in the same situation.
This means that a new I/O scheduler solve some problems but created new ones.
We need to prioritize the fix for this issue at the highest.
@tomer-sandler @dorlaor @harel-z FYI
(deleted comment mentioning customers)
How complex is to fix this?
I've a patch that has two problems
Please post it, we can use it as a base for brainstorming.
What's the latest status of this issue? (seeing if it'll make it to Scylla 5.4)
It's in lower prio, because there's a "workaround" -- one need to configure more io-groups than it's auto-detected by seastar to make groups' size smaller and thus reduce the per-group imbalance. Avi thinks it should be the default behavior.
It's in lower prio, because there's a "workaround" -- one need to configure more io-groups than it's auto-detected by seastar to make groups' size smaller and thus reduce the per-group imbalance. Avi thinks it should be the default behavior.
@xemul so are we making this the default behavior?
@xemul so are we making this the default behavior?
I lean towards it, but I've no good ideas how to calculate/estimate which amount of shards in a group is good enough
As seen on i3.large node (2 cores) in scylladb/scylla#10704