Significantly high memory usage on 0.41.0?

19shubham11 commented 1 year ago

Brief summary

We've been using k6 for a couple of months and were able to develop a test-suite that gave us around 200k rps with the use of 5 worker machines running on GCP each running the given suite. We use a ramping-vus iterator and this how our config looks like

        executor: 'ramping-vus',
        gracefulStop: '1m',
        startVUs: 0,
        stages: [
            { duration: '5m', target: 3000 },
            { duration: '5m', target: 5000 },
            { duration: '5m', target: 8000 },
            { duration: '5m', target: 10000 },
            { duration: '5m', target: 12000 },
            { duration: '5m', target: 15000 },
            { duration: '10m', target: 15000 },
        ],
        gracefulRampDown: '5m',
    },

the tests generally run for about 45 mins and reach max VUs of 15000.

Until yesterday we were running on k6 version 0.40.0, and upon updating to 0.41.0, memory usage went up really high. We have a max memory limit of 100GB on the instance and the 0.41.0 reached 85% memory usage within 10 mins of the test executing, I reverted back to 0.40.0 and the memory usage was down to ~10% for the entirety of the 45mins.

Is this a known issue, or related to something introduced in the newer version, or maybe something got deprcated and I need to adjust the setup somehow? Happy to provide more details if needed

k6 version

0.41.0

OS

Debian GNU/Linux 11 (bullseye)

Docker version and image (if applicable)

No response

Steps to reproduce the problem

running a high-scale test (15k) VUs on 0.41.0 and on 0.40.0

Expected behaviour

significantly high memory usage on 0.41.0

Actual behaviour

should perform the same?

na-- commented 1 year ago

Can you share something about what your script actually does? Does it generate a lot of metrics with unique/high-cardinality tags? For example, do you have a ton of unique URLs or something like that?

If so, the high memory usage might be because of this change in k6 v0.41.0 and you may be able to ameliorate it by using URL grouping: https://k6.io/docs/using-k6/http-requests/#url-grouping

If not, then please share any other details about your script to help us diagnose what the issue might be.

19shubham11 commented 1 year ago

Oh interesting, yes, the script actually has around 12 unique URLs and and a few with a path param that changes based on the previous response, they are mostly CRUD operations, and are called in sequence over again (as you can see maxes out at 15k VUs)

So if I understand correctly based on https://k6.io/docs/using-k6/http-requests/#url-grouping that it will be generating unique metrics per URL? (like users/1 and users/2 would be treated differently?)

I'm assuming a URL like /users/{:id} called 10k times will create 10k new metrics in the newer version? Anyway way to disable this?

na-- commented 1 year ago

So if I understand correctly based on https://k6.io/docs/using-k6/http-requests/#url-grouping that it will be generating unique metrics per URL? (like users/1 and users/2 would be treated differently?)

I'm assuming a URL like /users/{:id} called 10k times will create 10k new metrics in the newer version? Anyway way to disable this?

Yes. Or, rather, it will create 10k (or more, if you have other differences in tags) time series.

This is probably the problem. Try to use the http.url helper or (manually set the name tag) for these requests and you should see your memory usage significantly decrease. Memory usage (with a reasonable number of time series) and the garbage collection CPU overhead should actually be lower than v0.40.0 :crossed_fingers:

19shubham11 commented 1 year ago

Alright, thanks! I'll try and add tag to all of the paths and report back.

On an unrelated note, is it expected to have minor-ish "breaking" changes on normal releases; we just download the latest version so were just unaware of the 0.41 release until today.

(Haven't really read the full release policy myself so feel free to ignore, plus it's not a breaking change anyway just a performance dip IMO, anyway it's a great tool have loved using it so far)

19shubham11 commented 1 year ago

Added tags to the params and can confirm memory usage is not shooting up, thanks

na-- commented 1 year ago

Awesome :tada: Can you provide a rough estimate of how many unique URLs your script was hitting? Because even 10k-15k unique URLs (and so, time series) shouldn't have caused such a huge increase in memory usage according to our tests?

19shubham11 commented 1 year ago

10-15k was just an example I gave 😅 so for some actual numbers - endpoints with path params (unique URLs) would be called around 10k/sec -> so ~10000 60 45 = ~27M for a 45 min full test that we run. But since on 0.41.0 we almost went OOM after around 10 mins, that would be ~10000 60 10 = ~6M. So I'm assuming 6M unique time series and they kept adding up

na-- commented 1 year ago

Ah, yeah, that would certainly do it :sweat_smile:

Now that we can actually track the amount of unique time series, we will probably add some sort of a warning if some number is exceeded, e.g. 100k? :thinking: We'll need to do some benchmarking

19shubham11 commented 1 year ago

Yeah I think that would be great, but logs might be hard to follow sometimes, but would be something.(but it's also not easy to hit those numbers on a local setup with limited CPU/memory from what I learnt)

Is there a possibility to disable these time series metrics itself (on something planned in the future?) because I am not really using these too extensively, and we just rely on the prometheus metrics on the server side to validate our results and not on the loadtest client.

na-- commented 1 year ago

Is there a possibility to disable these time series metrics itself (on something planned in the future?)

Unfortunately you can't disable them and we probably won't add such a feature in the future, sorry :disappointed:

It's not ideal and it is a problem for some existing tests like yours, but on the other hand a whole bunch of core things that now work on top of the time series functionality are (or can be) way more efficient than before, and we also need time series for certain other feature to be possible to implement at all:

the current outputs have been and can be optimized with them and new outputs (e.g. Prometheus) basically require them
current sub-metric thresholds and future threshold improvements
JS APIs to control metrics (e.g. https://github.com/grafana/k6/issues/1321)

And yeah, unfortunately, if there millions of unique URLs in your test, you'd need to adjust your script slightly and add the name tag to group them, but it's a viable workaround. You needed to do that URL grouping with name even before, if you wanted to export your metrics to the k6 Cloud or InfluxDB, or basically any other output besides csv and json. Or if you needed to set thresholds on the metrics from these requests. With the http.url helper it's not even that big of an overhead, it's just a template literal with a few extra characters. It sucks, but most other tools that deal with metrics also have cardinality restrictions precisely for similar reasons we now need them... :disappointed:

And for non-URL unique tags, we intend to have a JS API to support high-cardinality metric metadata in the future, i.e. basically something like tags that doesn't result in new time series being created for different values. Right now that part is only internal (i.e. usable from Go code in the core and xk6 extensions).

19shubham11 commented 1 year ago

yeah totally makes sense, thanks for the clarification and the quick help on this issue as well :)

na-- commented 1 year ago

I'll close this issue since I opened https://github.com/grafana/k6-docs/issues/883, https://github.com/grafana/k6/issues/2765 and https://github.com/grafana/k6/issues/2766 for various parts of the things we touched here :sweat_smile:

grafana / k6