Make special handling of `check` metrics in StatsD output configurable

na-- commented 2 years ago

The statsd output (also used for NewRelic and DataDog, among others) currently has special handling of check metric samples:

https://github.com/grafana/k6/blob/59a8883676edf4e28edb39561c33ebd941e05944/output/statsd/output.go#L79-L87 https://github.com/grafana/k6/blob/59a8883676edf4e28edb39561c33ebd941e05944/output/statsd/output.go#L94-L100

It transforms the metric names to check.<check-name>.pass or check.<check-name>.fail. I don't remember why this was done, probably because StatsD doesn't seem to support anything like our Rate metric (so we transform it to 2 Count ones here) and because we didn't want the checks to be grouped together (sort of a localized https://github.com/grafana/k6/issues/1321 workaround). In any case, it definitely needs to be better documented (https://github.com/grafana/k6-docs/issues/603) and maybe ideally also configurable (this issue) :thinking:

For example, we can add a new K6_STATSD_EXPAND_RATE_ON_TAGS option to the statsd output, with the default value of check. Then:

the current behavior of the output will be preserved with the default value
if the user sets K6_STATSD_EXPAND_RATE_ON_TAGS to an empty string, the special handling of checks will be disabled
if the user sets K6_STATSD_EXPAND_RATE_ON_TAGS to something else, other Rate metrics could be expanded in a similar manner as well

Connected issues and forum threads:

papatelst commented 1 year ago

I am running K6 through teamcity on docker and sending data to Datadog using statsd.

There are 2 issues with http_req_failed metric.

It's reported as a random incorrect number. K6 build logs report it as a percentage and datadog shows it as an unrelated number.
Sometimes it is not reported at all. It happens randomly. It is a critical bug as not being reported means we can miss catching a bug during releases.

Datadog support team ticket 954788 was opened by me. They told me to open a ticket with K6 support. Please fix issue below. Here is what Datadog team found

After looking into this further regarding the "http_req_failed" metric, please see our findings below:

k6 documentation does state that http_req_failed is a rate -> (A metric that tracks the percentage of added values that are non-zero.). However, it is also k6 that is submitting the metrics to DD Agent as a counter. k6.http_req_failed:0|c|#scenario:contacts,proto:HTTP/2.0,tls_version:tls1.3,method:GET If the metric is incorrect, we believe this is because k6 is sending them incorrectly. As the integration is managed by K6, I'd suggest bringing this up with them as well (support@k6.io). integrations-extras/CODEOWNERS at master · DataDog/integrations-extras

mstoykov commented 1 year ago

Thanks for your feedback @papatelst :bow:

it's reported as a random incorrect number.

The thing is the number isn't random - it is the number of possitive http_req_failed values so - the number of times an http request has failed. You are missing hte other part which is the number of times it didn't fail.

If it is not reported it means there were zero failures.

I guess you can correlate this to http_reqs so http_req_failed/http_reqs will give you the rate of failed request. This depends on you not having requests with null responseCallback.

Datadog support team ticket 954788 was opened by me

This seems to be some private ticket that I can not access . Don't know if we need to or we should either :shrug:

It seems to me like we need to implement the check hack for the rest, but maybe pass and fail are the wrong "suffixes".

I also think that we should make this the default, but this will require it to be implemented first and at least a release of warnings around it. So maybe let's focus on the implementation part first.

papatelst commented 1 year ago

@mstoykov Here is an example. http_req_failed in K6 build log shows 59.88% whereas Datadog UI shows 1.09. What is the relation between these 2 numbers ? K6 build logs show 67 requests failed out of 167 (http_reqs), that should be a percentage of 67/167 = 40.11%. Not sure why it shows 59.88% ? K6 build log Datadog UI image Can K6 send percentage value to datadog for display ?

There is also another issue where k6.http_req_duration.avg in datadog UI is close to, but never the same as K6 build log. K6 is not sending certain metrics to Datadog. The feature request is for K6 to send the following example metric to datadog. k6.http_req_duration.avg:13930.0 in milliseconds. This is a snapshot from datadog ticket response.

mstoykov commented 1 year ago

K6 build logs show 67 requests failed out of 167 (http_reqs), that should be a percentage of 67/167 = 40.11%. Not sure why it shows 59.88% ?

You are misinterpreting the results it is 100 failed and 67 not failed so 100/167. Unfortunately this is pretty common as having double negatives is not a great idea, and already has an issue with more info #2306.

The 1.09 is confusing to me as well, and I can only guess that given that it is an average (btw never use averages ;) ) it is the average of the value over some period. So Datadog received 1, 10, 0, 0, 0, 3 and averaged it and got that number. Maybe ask them how they aggregate - at this point in time k6 does not aggregate when outputting to the statsd output which the same one Datadog uses. You should be able to see over what period that avg is in the query - it seems like you added this one in particular.

Can K6 send percentage value to datadog for display ?

statsd the protocol , which is what Datadog receives does not support percentages. The way to do them is basically what k6 does specifically for the check metric where we send 1 metric that is the positive values of the rate and one that is the negative(or the total number). And then use Datadog to do the math as in pass/(pass+fail) for example.

Which is basically what this issue is about - adding the ability for the other internal k6 Rate metrics to be send to Datadog in this format.

I also just found this community forum thread where if you stop the Datadog agent too fast it doesn't send the last few metrics.

I also ended up actually doing the dashboard and trying to populate it, and it looks something like (this is on top of the dashboard I had before that which seems to be the one that k6 contributed) Where the http_req_failed graph settings look like Notes:

I have the formula in the first one where I calculate the http_req_failed/http_reqs as I mention before but do not show the original values
A second display with the two metrics (watch out to use max instead of avg as that is something Datadog calculates). Which is then set to a different Y-axis This is so the rate is actually visible as it will be between 0 and 1, and usually you have way more requests than that.

As you can also see if I don't average over the metrics `http_req_failed is a whole integer (as it should be).

They are response on the http_req_duration.avg makes no sense to me, because the datadog-agent seems to send avg on it's own, and I have it in my graph just now. Maybe they just didn't see it as I also have problems with their UI not showing all the metrics until I wait for 5-10 seconds :shrug:. Or maybe you have it configured differently or maybe I misunderstood what they are saying :shrug:

Also, that avg is over the time the Datadog agent aggregates/buffers before sending to Datadog (from what I understand) - again k6 does not aggregate samples at this point in time on the k6 side for the statsd output. But the Datadog agent definitely does. So the http_req_duration.avg is the average over the time Datadog buffers. And then you get a bunch of those in Datadog and if you average over them you will (or more accurately are very likely) to not get the same value as averaging over all the values, as k6 does. And this seems to be what you are doing given the screenshot.

I do remember there were some problems with high volumes and docker, so maybe if you are using docker or not you can decrease K6_STATSD_PUSH_INTERVAL to "100ms" or even "10ms" :shrug: to see if they come even closer, but that might not be the problem?

Please if you want to report different issues - make new issues. This one is about how Rate is handled when we need to send it to statsd/datadog. Discussing issues related to it is likely to only muddy this issue and make it harder to discuss the particular thing at hand. It also makes it harder to keep track of what we have answered/looked into and not.

Again thank you for reporting this :bow:

papatelst commented 1 year ago

@mstoykov 1.09 is not an average of multiple runs. It is the single value of a single run. In my datadog UI snapshot you can see Value is the 5th column and Avg is the first column and both have same values as I am highlighting only 1 run in my snapshot. Are you saying that once this issue is resolved, I should see 100/167 = 0.59 in datadog graph ? If not, what is this ticket trying to do and how can I see 0.59 in datadog UI instead of an unrelated number like 1.09 ?

Should I open a separate github issue for "There is also another issue where k6.http_req_duration.avg in datadog UI is close to, but never the same as K6 build log. K6 is not sending certain metrics to Datadog." @Vmihajlovic told me that this issue should address both (http_req_failed and k6.http_req_duration.avg) support tickets I opened with K6.

mstoykov commented 1 year ago

It is the single value of a single run. In my datadog UI snapshot you can see Value is the 5th column and Avg is the first column and both have same values as I am highlighting only 1 run in my snapshot.

NO - it clearly is the average of the values - k6 does not calculate the average it just sends the values. If you look carefully you will see that there is avg: and the beggining of the metrics which means that datadog is calculating the average. If you use max by (or I guess sum by, in the case of counter metrics) instead of avg by in the formula here

(specifically left one of them as average to see the difference) Then you get And as you can see one starting with avg is way smallwer then the one with sum and is the average of the points for some interval datadog comes up with (I couldn't find it in their docs).

The sum by on the other hand was what I got locally yesterday.

If you notice all the values in the max/min/sum/values for not the sum: metrics are integers while the one starting with avg: is all floats . The faliure rate on top is what you want which as I mention before is calculated as: Until we fix this issues and there is likely http_req_failed.pass and http_req_failed.failed or http_req_failed.all.

Should I open a separate github issue for "There is also another issue where k6.http_req_duration.avg in datadog UI is close to, but never the same as K6 build log. K6 is not sending certain metrics to Datadog." @Vmihajlovic told me that this issue should address both (http_req_failed and k6.http_req_duration.avg) support tickets I opened with K6.

Yes - let's open a new issue

LeonAdato commented 1 week ago

Per @olegbespalov and @javaducky, this issue should probably be part of the StatsD project. Feel free to transfer it here:

https://github.com/LeonAdato/xk6-output-statsd

olegbespalov commented 1 week ago

Closing in favor of https://github.com/LeonAdato/xk6-output-statsd/issues/28

grafana / k6

Make special handling of `check` metrics in StatsD output configurable #2470