Cross-shard preemption is needed

xemul commented 1 year ago

As seen in, e.g. scylladb/scylla#12562, high-prio class with low-concurrency workload is in trouble.

all shards have all low-prio queues full, disk is fully dispatched, shards line-up on shared token bucket
high-prio request pops up on some shard
scheduler decides that the new class should preempt everything, but the best it can do is preempt everything on this shard only

As a result, the high-prio request happens in the middle of the global token queue and have chances to get dispatched only after all the preceding other shards complete their IO

Can be reproduced with io_tester: io_tester --conf jobs.yaml --storage /dev/null --duration 5 and jobs.yaml being

- name: big_writes
  shards: all
  type: seqwrite
  data_size: 1GB
  shard_info:
    rps: 8000
    parallelism: 2
    reqsize: 128kB
    shares: 80

- name: latency_reads
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    rps: 250
    parallelism: 1
    reqsize: 512
    shares: 1000
  options:
    pause_distribution: poisson

avikivity commented 1 year ago

I think reducing the group sizes will help, and also tightening the scheduler's idea of what is an idle class.

xemul commented 1 year ago

I think reducing the group sizes will help,

Agree

and also tightening the scheduler's idea of what is an idle class.

Maybe tightening the preemption criteria? Because "idle class" notion is as simple as "zero requests in it".

...

While I was writing the above question I got an idea. We could inject "empty" requests into the global token queue for classes that are idling for too little (i.e. -- had a request dispatched recently and didn't yet have a new one queued). So that when there pops up a request for real, it could "preempt" that empty one thus getting closer to the dispatcheable future.

avikivity commented 1 year ago

I think reducing the group sizes will help,

Agree

and also tightening the scheduler's idea of what is an idle class.

Maybe tightening the preemption criteria? Because "idle class" notion is as simple as "zero requests in it".

Zero requests for some time?

...

While I was writing the above question I got an idea. We could inject "empty" requests into the global token queue for classes that are idling for too little (i.e. -- had a request dispatched recently and didn't yet have a new one queued). So that when there pops up a request for real, it could "preempt" that empty one thus getting closer to the dispatcheable future.

Aha. A sort of reservation.

scylladb / seastar

Cross-shard preemption is needed #1430