Open travisdowns opened 1 year ago
Created https://github.com/redpanda-data/seastar/pull/82 for more metrics on the io side.
Still need something for reactor poll-to-poll.
Metrics or self-profiing around syscalls (or maybe attach strace for 1s in the debug bundle?)
Not sure about this one. While strace is probably available everywhere I don't think it's up for the task performance wise. Just tried it on a lightly loaded cluster and just one second already resulted in leadership transfers and latency spikes. Hence not sure this is something we want to do in a debug situation.
More metrics around the io queue, e.g., how full the token bucket is, if the bucket got to zero (i.e., "throttling happened"), how many times requests were ineligble for exec because ticket could not be acquired or at least something which lets us distinguish "time in the io queue is because it's actually holding back requests" vs "time in the io queue is due to other factors, because as soon as they are processed they are exec'd"
Looking at metrics again I think in theory at least the io_queue should throttle when rate(vectorized_io_queue_consumption)
approaches 1. Though I guess that doesn't differentiate between disk feedback and refill based types. However the former doesn't exist anymore. It's still not a hard "got rejected metric" so possibly if you are just at the edge it might not be super precise.
should throttle when rate(vectorized_io_queue_consumption) approaches 1
Right but the disadvantage here is that it's averaged over the metrics sampling interval. So rate close to 1 is strong evidence throttling is occurring but the converse is not true: rate could be 0.1 but significant throttling could occur within the interval (say 5 seconds)
Still, it's a useful guideline that will identify some types of throttling.
A grab bag of follow ups for slowdown due to different disk speeds. Can be broken out into their own issues:
strace
for 1s in the debug bundle?)JIRA Link: CORE-1477