Statsd seems to be dropping metrics probably because of UDP

mstoykov commented 3 years ago

Discussed in forum thread.

My last comment confirms that:

running between dockers is worse
fixing MTU in docker helps
lowering push interval (effectively always pushing smaller packets) helps more and effectively fixes it in my case, although this shouldn't be needed as the lib should be doing the same :shrug:
updating datadog-go makes it slightly worse :facepalm:
k6.v0.29.0(random older version I tried) is even worse. :facepalm:

Possible actions:

lower the push interval - this seems like a good fast solution, given that we are sending metrics as udp packets and they have a small limit, k6 not doing it more regularly doesn't really help anything IMO
investigate this further, it might not be UDP related at all, at which point switching statsd libraries to something with possibly tcp support as well might not even help.
investigate datadog-go + aggregation in datadog-go ... maybe it will help :shrug:

shekharshukla commented 2 years ago

Is there any update on this?

mstoykov commented 2 years ago

If there is an update we will be updating the issue. Do you want to work on it?

rodrigotheodoropsv commented 1 year ago

I'm having the same problem with that :(

K6 output:

DataDog:

Idosegal commented 1 year ago

Any updates about this issue? I'm facing the same behavior.

DanielCastronovo commented 1 year ago

Same here

rodrigotheodoropsv commented 1 year ago

I've just changed where I run the datadog-agent and changed some settings of the SO, it worked for me. Now I run the agent in the same machine that I run the K6, so I use the loopback to send the metrics from k6 to datadog-agent and send it to the DataDog normally.

bgallagher88 commented 1 year ago

@rodrigotheodoropsv - That's great that you got it working. Can you tell me what settings were changed to get it to work?

rodrigotheodoropsv commented 1 year ago

We use a VM with Ubuntu, but I don't know the version and more details about it because the team of DevOps who take care about it.

Ps: There are a lot of settings that we use for other things.

The settings are:

Idosegal commented 1 year ago

Does anyone experienced the same in a containerized environment?

kullanici0606 commented 1 year ago

I tried to investigate the issue and my initial hunch is that the problem is not related to udp. It seems that statsd could not handle incoming packets fast enough.

I first tried netcat to verify that packets are delivered to statsd host (Neither tcpdump nor a package manager exists on statsd-exporter).

I changed the statsd address to port 9127 on k6 container definition:

K6_STATSD_ADDR: "statsdex:9127"

I also added some sleep to command so that it waits until nc is started on statsd-exporter.

On statsd-exporter, I started nc: nc -l -u -p 9127 > packets.txt

After the end of the test request count reported by k6 matches the "k6.http_reqs" count on packets.txt:

$ grep -c k6.http_reqs out2.txt 
13786

k6_1 | http_reqs......................: 13786 229.537003/s

On the other hand, when I use statsd as output address, on its own udp receive metrics, it reports less packets than the ones captured by tcpdump on k6 host: UDP packets from pcap (udp.dstport == 9125): 21535

$ curl -s http://localhost:9102/metrics | grep udp
# HELP statsd_exporter_udp_packets_total The total number of StatsD packets received over UDP.
# TYPE statsd_exporter_udp_packets_total counter
statsd_exporter_udp_packets_total 21272

When I checked the statds_export code (https://github.com/prometheus/statsd_exporter/blob/871e2d8df1da51d4eed27738f4bc079271a09c61/pkg/listener/listener.go#L54-L82), it seems that instead of processing the packet in a seperate goroutine/thread, it first parses the packet to events, then sends them to a channel in the udp handling goroutine/thread. Parse operation might take a lot of time depending on the metrics delivered and this may cause packet drops.

It seems that k6 sends a large UDP packet:

k6.vus:15.000000|g k6.vus_max:50.000000|g k6.http_reqs:1|c|#name:https://test.k6.io/,tls_version:tls1.3,scenario:default,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1,expected_response:true k6.http_req_duration:146.138629|ms|#name:https://test.k6.io/,tls_version:tls1.3,proto:HTTP/1.1,scenario:default,expected_response:true,url:https://test.k6.io/,method:GET,status:200 k6.http_req_blocked:3677.270817|ms|#expected_response:true,scenario:default,name:https://test.k6.io/,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1,tls_version:tls1.3 k6.http_req_connecting:12.663258|ms|#status:200,expected_response:true,url:https://test.k6.io/,name:https://test.k6.io/,scenario:default,proto:HTTP/1.1,method:GET,tls_version:tls1.3 k6.http_req_tls_handshaking:3607.250707|ms|#expected_response:true,scenario:default,name:https://test.k6.io/,tls_version:tls1.3,url:https://test.k6.io/,method:GET,status:200,proto:HTTP/1.1 k6.http_req_sending:0.178607|ms|#url:https://test.k6.io/,proto:HTTP/1.1,expected_response:true,tls_version:tls1.3,method:GET,status:200,scenario:default,name:https://test.k6.io/ k6.http_req_waiting:145.692124|ms|#name:https://test.k6.io/,proto:HTTP/1.1,status:200,url:https://test.k6.io/,method:GET,tls_version:tls1.3,scenario:default,expected_response:true

This is my initial investigation, it may not be 100% percent correct though :)

kullanici0606 commented 1 year ago

Lowering flush interval helps not because packets are smaller (tcpdump shows similar packet sizes, i.e around 1400 bytes) but packets are not sent in bursts when interval is smaller.

I mean when I compared network captures of 1s and 100ms interval, it shows this (I am summarizing):

1 second interval: 564 packets in 40 ms -> 14,1 packets/ms

100 ms interval: 23 packets in 5 ms -> 4,6 packets/ms

So higher flush interval means more packets to process by statsd_exporter with less time, therefore if it cannot catch up, it drops packets.

Before going further, could you please tell me which direction to pursue?

Newer statsd (datadog-go) client has a feature called "ClientSideAggregation" which reduces the number of packets sent to statsd by aggregating them on client side, however k6's dependency policy explicitly states that it is not prefferred to update datadog-go dependency, therefore going on that direction is not feasible I think

kullanici0606 commented 1 year ago

I missed the references pull request (https://github.com/alphagov/di-devplatform-performance/pull/122) which also lowers the flush interval.

I also tried changing statsd_exporter logic to process udp packet in a different goroutine, which helped a lot (no packet drop occurred for several attempts):

https://github.com/kullanici0606/statsd_exporter/tree/enqueu_udp_packets

I think, lowering the flush interval to 100ms, increasing the buffer size and changing statds_exporter logic a little bit will be the easiest and the best path for solution.

kullanici0606 commented 11 months ago

The version of the k6 in the docker file on the forum is 0.32.0 and on v.0.33.0 some tags are blacklisted / blocked so that UDP packets are smaller:

https://github.com/grafana/k6/issues/2048

https://github.com/grafana/k6/blob/3655065227b09ee368c29fb3dde7650761e22f68/release%20notes/v0.33.0.md#breaking-changes

So upgrading k6 to a version newer than v0.33.0 will definitely help. To sum up:

Upgrade to v0.33.0 or later
set K6_STATSD_BUFFER_SIZE to 100ms or less (K6_STATSD_PUSH_INTERVAL: 100ms in docker-compose.yml)

will reduce the probability of packet drops.

@mstoykov Do you think this issue needs further investigation?

I tried implementing part of the client side aggregation on k6 by decorating / wrapping statsd client but it seems that most of the metrics are trends / timing information which cannot be aggregated easily. Counter is easy but there is only one counter, so it helped a little when I tried aggregating it.

https://github.com/grafana/k6/blob/3655065227b09ee368c29fb3dde7650761e22f68/metrics/builtin.go#L87-L95

mstoykov commented 11 months ago

@kullanici0606 This issue will likely be moved along all other statsd issues "soon" :tm: to the https://github.com/LeonAdato/xk6-output-statsd as part of #2982.

I guess v0.33.0 helped a lot, but this still will happen with enough requests from my memories when I last tested it.

kullanici0606 commented 11 months ago

Yes, even with v0.33.0, problem occurs sometimes, therefore I also tried to do some improvements on statsd_exporter side too: https://github.com/prometheus/statsd_exporter/pull/511

@mstoykov should I stop working on this issue then?

mstoykov commented 11 months ago

should I stop working on this issue then?

I would recommend either waiting for the final move of issues or opening an issue with the new repo.

The k6 core team will in practice have no connection with this and as part of that I will not tell anyone how and what to do with the extension :)

wvikum commented 2 months ago

We have encountered a similar issue where there is a discrepancy of around 30,000 requests in Datadog metrics, which is quite frustrating. Do we have a proper solution or workaround for this issue?

codebien commented 2 months ago

Hey @wvikum, in the past weeks, we released a new extension for OpenTelemetry https://github.com/grafana/k6/issues/2557#issuecomment-2180165706.

We intend to merge into k6core in the upcoming releases, and it will be probably the best solution to stream metrics to a 3rd party metrics database.

Please, check if that works for you and with your DataDog integration.

LeonAdato commented 1 week ago

Per @olegbespalov and @javaducky, this issue should probably be part of the StatsD project. Feel free to transfer it here:

https://github.com/LeonAdato/xk6-output-statsd

olegbespalov commented 1 week ago

Closing in favor of https://github.com/LeonAdato/xk6-output-statsd/issues/29

grafana / k6

Statsd seems to be dropping metrics probably because of UDP #2044