Coordinated Omission: Improve latency measurement and reporting?

giltene commented 3 years ago

I've been digging into tip-stress recently (cool tool!) and wanted to know if there is interest in improving the latency reporting mechanism, to better report on externally-observable latency.

As it currently stands, the measurement technique used appears to be susceptible to the Coordinated Omission problem (a term I copied a while back, and a problem prevalent in many load generators in monitoring tools). The mechanism by which it shows up here (as in most load generators that have not fixed it yet) involves a coordinated stall between the server being measured and issuing of operations on the load generator side. A stall on the server side will effectively stall the measurements in each load generator thread, as each thread will not issue new requests until the one that is in flight either succeeds or fails. When the threads do issue the request, they set the start time as the time that the request was issued (a time that the server was able to delay through stalling), rather than the time that the request would have been issued at had the coordination (and stall of the load generator thread) not happened. The resulting effect is that the latencies measured (and the stats they reflect) can end up discounting most of the server stall effects, reporting stats (like p99) that appear to show significantly lower latency than the latency that actual clients would experience. In effect, the reported latency stats describe something that is much closer to the service time within the server, and often very far from the externally visible response time that a non-coordinating client would see. The gap between the two can often be multiple of orders of magnitude, especially at higher %'iles, and especially when no trivial JVM pauses or other server stalls are exhibited.

For a detailed example of what the problem is and how different latency reports can be when it is corrected, look at README for wrk2. The wrk2 repo was originally intended to simply as an example of how to fix coordinated omission in load generators (I used it for that purpose in a workshop I gave on the subject), but it seems to have unintentionally grown in use. I'd much rather have this sort of thing fixed (or addressed from the start) in the mainstream version of each load generator (as we've managed to do with e.g. cassandra-stress, YCSB, Galting, and RadarGun).

A simple litmus test I use, which immediately shows whether or not coordinated omission is in play, is the "what is the reported latency when the actual throughput is well below the load generator's attempted throughput" test. When a server is unable to actually satisfy the incoming load, and the load is not coordinating with the server, a queue of pending but not yet handled requests grows over time. Each of these pending requests has a start time determined by when it was attempted, not by when the server was willing to accept it, so externally experienced latencies (across all %'ile) grow linearly with time; the longer the test is run the longer the latencies are (typically grwoing in a clean linear line). Whenever a load generator reports latencies that do not grow over time under such over-saturated-server conditions, it is exhibiting Coordinated Omission. That omission mechanism is usually not limited to continually satuarted conditions, and is likely affecting it's reporting even under non-saturated conditions (e.g. during the occasional multi-second or sub-second stalls that servers may exhibit under loads where 10s of thousands of operations per second are being modeled). tlp-stress certainly exhibits this tell-tale behavior. And the coordination is pretty easy to spot in the code. It is basically here and here.

There are many simple ways around this CO problem, and I'd like to collaborate on isolating it and fixing it within tlp-stress to improve it's latency reporting without [neccesarily] changing what it actually does from a stress-generation point of view. Having fixed (or influenced others to fix or get right from the start) similar issues in multiple other load generators I've learned to find simple ways to tweak existing generators to gather the right information, and I'd like to explore possible options to do that in the current code base.

One of the simplest ways to fix this may be to modify the start time measurement so that start time will reflect the intended/planned start time for the operation under the rate-limited model, rather than the time that the operation actually got to be sent to the server when it was ready to take them. This can be done by e.g. having two threads for each profile runner: a generator thread that would enqueue pending operations, each with it's own start time, to a [very deep] queue on a schedule driven by the rate limiter, and a consumer thread that will pull pending operations from that queue as fast as it can, actually execute them, leaving the callback code run on completion of each operation unmodified.

adejanovski commented 3 years ago

Hi @giltene,

I've been digging into tip-stress recently (cool tool!)

Thanks! We can thank @rustyrazorblade for building it from scratch.

You're totally right about tlp-stress being subject to coordinated omission. I've pushed a draft PR which implements a statement generator that will honor the rate limit and start latency timers, then posts the statements to a queue. Consumer threads will run the statements against Cassandra with a configurable concurrency level to control the number of async requests. I've added a timeout over which requests that have been sitting in the queue for too long will be dropped, otherwise I was getting stress tests crashing with OOM when Cassandra couldn't keep up with the requested rate. Such dropped statements are accounted as errors in the output metrics.

I still need to figure out how to handle unthrottled stress tests, most probably by pausing statement generation when the queue reaches a certain size. Or maybe we could implement some kind of backpressure based on p99 latencies, reducing the throughput when we go over a configurable threshold. This could also help running stress tests that would answer the "what throughput can I get without going over ***ms?".

Your thoughts on this implementation would be greatly appreciated.

julienlau commented 3 years ago

Any plan to release this as is ? Maybe adding the client queue size as input parameter and drop requests when queue is filled (to avoid OOM) would be ok in a first version ? In a later version, supervision algorithm based on queue size and/or timeout could be added to implement an automatic "max-troughtput without client queuing" determination ?

rustyrazorblade commented 3 years ago

Hey folks - I meant to reply to this and got sidetracked. I'm about to switch jobs and will be able to work on this tool again, and one of the first things I'd like to do is address this issue. I'll have a PR up soon-ish.

thelastpickle / tlp-stress

Coordinated Omission: Improve latency measurement and reporting? #131