grafana / k6

A modern load testing tool, using Go and JavaScript - https://k6.io
GNU Affero General Public License v3.0
25.72k stars 1.26k forks source link

"sliding window" thresholds #2379

Open LaserPhaser opened 2 years ago

LaserPhaser commented 2 years ago

Feature Description

The current implementation of the threshold mechanism works only for absolute values. For example: When I set "autostop" for 5% errors, it means that I need to collect 5% of the errors from the whole run. But usually, degradations happen when RPS became really high. And if you get to this amount of RPS step by step, you need to wait for some time and it will be for example 100% of errors for the last 1 minute, which results in 10% of errors for the whole run.

As a numeric example: we run the following configuration: 10 RPS for 1 minute - it will cause in a total of 60 requests - everything is 200 -ok 20 RPS for 1 minute - it will cause in a total of 120 requests - everything is 200 -ok 30 RPS for 1 minute - it will cause in a total of 180 requests - everything is 200 -ok 50 RPS for 1 minute - it will cause in a total of 300 requests - everything is 200 -ok 60 RPS for 1 minute - it will cause in a total of 360 requests - and here became crash on last 10 sec - so 300 RPS ok and 60 error

So finally we have 60+120+180+300+300 = 960 - "200 ok" and 60 - "500 fails"

This 60 will be only about 5.8% of the total.

But for the last 10 seconds, it will be a 100% error rate.

Suggested Solution (optional)

My suggestion is to add "sliding windows" for thresholds. For example, I could be interested in "error rate" only for the last 1 minute or even 10 seconds. Something like:

export const options = {
  thresholds: {
    http_req_failed: ['rate<0.01[1m]'], // http errors should be less than 1% for the last 1m
    http_req_duration: ['p(95)<200[10m]'], // 95% of requests should be below 200ms for the last 10min
  },
};

Already existing or connected issues / PRs (optional)

No response

na-- commented 2 years ago

This is somewhat of a duplicate of https://github.com/grafana/k6/issues/1136, but it's much better explained (:blush:) and the other issue has become more of a catch-all that just collects various semi-related threshold improvement ideas, so I'll leave both open for now...

Implementing this efficiently will be quite complicated though. Sliding time windows are probably easy and efficient to implement for Counter metrics, but not so much for Trend ones... And I have no idea how HDR histograms (https://github.com/grafana/k6/issues/763) will work with them :confused: The syntax might also look different from what you propose - there are other issues with the current threshold syntax and we might adopt a v2 syntax that resembles something like PromQL, for example... :man_shrugging: Still, it's definitely a very valid use case we need to address, so thank you for opening such a detailed issue.

For now, as a workaround in some situations, you can approach the problem from the opposite direction... Instead of setting thresholds for time windows, you can set the thresholds for specific tags (sub-metrics) and use the recently introduced ability to manually set VU-wide custom metric tags through the vu.tags property from k6/execution. You can set different tag values based on the current test execution time, e.g. here's how you can tag metrics based on the stage the script is currently in: https://github.com/grafana/k6/issues/796#issuecomment-959396841 It's not the same and it's much less flexible than sliding time windows, but it's a viable workaround for some simpler cases.

LaserPhaser commented 2 years ago

@na-- maybe we can use https://pkg.go.dev/github.com/RussellLuo/slidingwindow#section-readme for example? I think I can implement a sliding window for "rate" with this library as Proof of Concept of the feature.

na-- commented 2 years ago

maybe we can use https://pkg.go.dev/github.com/RussellLuo/slidingwindow#section-readme for example?

I am not sure this specific library could actually be used to calculate the sliding window thresholds for a Rate metric, it seems more like a rate-limiter implementation :confused: Maybe some of its internals can be reused, I don't know, but it doesn't matter all that much for now - that's probably the smallest potential problem I can see with this proposal. I don't want to dissuade you from trying to implement something like this, but there are a lot of issues and current in-progress work that surrounds these parts of k6 and that will probably prevent us from merging any such contribution soon, if ever... :disappointed:

We are currently in the midst of some pretty big threshold refactoring (see https://github.com/grafana/k6/pull/2356 and the connected issues, cc @oleiade), as the first step towards better thresholds. The problem is, we are still not sure about what steps 2, 3 and so on look like yet. We just know that there are plenty of deficiencies with the current thresholds, both in their capabilities and in their syntax, but we don't know exactly what the end goal looks like yet. For example, the syntax v2 might be PromQL-like, it might be something like what you propose (though rate[1m]<0.01 is probably better than rate<0.01[1m] :thinking: ), it might be something completely different :man_shrugging:

Somewhat connected to the above, we are also in the middle of refactoring how we handle metrics and metric samples. Recently we introduced a metrics registry (https://github.com/grafana/k6/issues/1832) and likely upcoming changes include the tracking of distinct time series (https://github.com/grafana/k6/issues/1831), user control of which metrics and sub-metrics k6 actually emits (https://github.com/grafana/k6/issues/1321), and refactoring in how we store metrics in-memory, likely including transitioning to something like HDR histograms (https://github.com/grafana/k6/issues/763) for Trend metrics.

Finally, thresholds in k6 run are evaluated somewhat differently than thresholds in k6 cloud / distributed tests, since you have multiple streams of metrics to crunch. So, even if the local implementation looks easy, the cloud/distributed execution needs its own evaluation and/or better validation.

All of these things might introduce different tradeoffs and affect how we implement "sliding window" thresholds, and vice-versa. So, it's currently difficult to gauge if any one-off changes like the one you propose in this issue will be in the direction we want to go or in some different direction that ties our hands... :disappointed:

srperf commented 1 year ago

I would do this with a custom metric. Create a threshold against it. During the executions, add the value as it is generated. Every time we move from one time window to the next increase the metric in an order of magnitude and se the threshold as well. That way the previous values are not in the significative numbers for the threshold to take into account. One Idea to add up to this situation.

The other I can think is to give the functionality to restart custom metrics and keep the threshold against that metric.