Follow-ups for different disk speeds issue

redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!

https://redpanda.com

9.65k stars 589 forks source link

Follow-ups for different disk speeds issue #13824

Open travisdowns opened 1 year ago

travisdowns commented 1 year ago

A grab bag of follow ups for slowdown due to different disk speeds. Can be broken out into their own issues:

Metrics around reactor polling performance, e.g., poll-to-poll interval
More metrics around the io queue, e.g., how full the token bucket is, if the bucket got to zero (i.e., "throttling happened"), how many times requests were ineligble for exec because ticket could not be acquired or at least something which lets us distinguish "time in the io queue is because it's actually holding back requests" vs "time in the io queue is due to other factors, because as soon as they are processed they are exec'd"
More metrics around the reactor io sink e.g., queue time/occupancy of the sink
Metrics or self-profiing around syscalls (or maybe attach strace for 1s in the debug bundle?)

JIRA Link: CORE-1477

StephanDollberg commented 1 year ago

Created https://github.com/redpanda-data/seastar/pull/82 for more metrics on the io side.

Still need something for reactor poll-to-poll.

Metrics or self-profiing around syscalls (or maybe attach strace for 1s in the debug bundle?)

Not sure about this one. While strace is probably available everywhere I don't think it's up for the task performance wise. Just tried it on a lightly loaded cluster and just one second already resulted in leadership transfers and latency spikes. Hence not sure this is something we want to do in a debug situation.

StephanDollberg commented 9 months ago

More metrics around the io queue, e.g., how full the token bucket is, if the bucket got to zero (i.e., "throttling happened"), how many times requests were ineligble for exec because ticket could not be acquired or at least something which lets us distinguish "time in the io queue is because it's actually holding back requests" vs "time in the io queue is due to other factors, because as soon as they are processed they are exec'd"

Looking at metrics again I think in theory at least the io_queue should throttle when rate(vectorized_io_queue_consumption) approaches 1. Though I guess that doesn't differentiate between disk feedback and refill based types. However the former doesn't exist anymore. It's still not a hard "got rejected metric" so possibly if you are just at the edge it might not be super precise.

travisdowns commented 9 months ago

should throttle when rate(vectorized_io_queue_consumption) approaches 1

Right but the disadvantage here is that it's averaged over the metrics sampling interval. So rate close to 1 is strong evidence throttling is occurring but the converse is not true: rate could be 0.1 but significant throttling could occur within the interval (say 5 seconds)

Still, it's a useful guideline that will identify some types of throttling.