kamon-io / Kamon

Distributed Tracing, Metrics and Context Propagation for applications running on the JVM
https://kamon.io
Other
1.41k stars 328 forks source link

Datadog - Multiple events are sometimes concatenated (missing newline) #779

Open jgagnon1 opened 4 years ago

jgagnon1 commented 4 years ago

I noticed that statsd at times concatenates data received before processing it. I.e., if I send sequence of data like

example.filesInFolder:1|c
example.listFolderTR:4|ms
example.httpSendTR:8|ms
example.httpSend:1|c

statsd may process single concatenated line

example.filesInFolder:1|cexample.listFolderTR:4|msexample.httpSendTR:8|msexample.httpSend:1|c

Looks like there's correlation between how many lines are sent and frequency of the above behavior (the faster data are sent, the more often concatenation happens).

I think there might be a race condition possible at this line : https://github.com/kamon-io/Kamon/blob/master/reporters/kamon-datadog/src/main/scala/kamon/datadog/DatadogAgentReporter.scala#L194

But I haven't digged down to see if this method can be called by multiple threads.

dvgica commented 2 years ago

I'm able to reproduce this bug fairly easily. I have a counter metric that increments with each request received, and firing requests at it repeatedly via point-and-click in Insomnia triggers the issue. I'm capturing and printing out UDP packets with socat - udp4-listen:8125,reuseaddr,fork and there are plenty of cases like:

http.response:0|c|#host:fiver,service:flinflon-dvg,env:integration,instance:flinflon-dvg@fiver,resource_name:post_/partner/v3.0/cfar_offers,response_code:401,response_error_type:clienthttp.response:0|c|#host:fiver,service:flinflon-dvg,env:integration,instance:flinflon-dvg@fiver,resource_name:put_/partner/v3.0/sessions/{}/identify,response_code:204,response_error_type:none
http.response:0|c|#host:fiver,service:flinflon-dvg,env:integration,instance:flinflon-dvg@fiver,resource_name:post_/partner/v3.0/sessions,response_code:400,response_error_type:client

Note the first line has two metrics concatenated without a newline, while the third increment does have a newline.

In our production environment with 1000 RPM, it's causing about half of the counter increments to be dropped, I assume because dogstatsd doesn't know what to do with the concatenated data in the UDP packet.

Furthermore, I can also reproduce this issue using the vanilla statsd reporter. This is maybe not surprising since the two reporters are very similar. Example:

flinflon-dvg.fiver.executor_tasks_completed.akka_system.default.env.integration.name.akka_actor_internal-dispatcher.type.ForkJoinPool:0.0|cflinflon-dvg.fiver.http_response.env.integration.resource_name.post__partner_v3_0_sessions.response_code.400.response_error_type.client:0.0|c
flinflon-dvg.fiver.http_response.env.integration.........

Again, note the first two counters are concatenated without a newline separator, while the third starts with a newline.

It's a bit easier to reproduce when the kamon.metrics.tick-interval is set lower, like 1.second, because this causes more reports to be sent out with increments of 0. This lines up with @jgagnon1 's observation.

I'm not familiar with the code in question, but I might take a look to see what I can do. @ivantopo might have suggestions?

dvgica commented 2 years ago

I worked around this with my own naive PacketBuffer implementation, and it didn't actually help with my problem. So, something else is going on in my setup at least.

I also realized that my socat testing method is not definitive since you can't see packet boundaries. I'm going to follow up with some Wireshark investigation at some point.