Open jgagnon1 opened 4 years ago
I'm able to reproduce this bug fairly easily. I have a counter metric that increments with each request received, and firing requests at it repeatedly via point-and-click in Insomnia triggers the issue. I'm capturing and printing out UDP packets with socat - udp4-listen:8125,reuseaddr,fork
and there are plenty of cases like:
http.response:0|c|#host:fiver,service:flinflon-dvg,env:integration,instance:flinflon-dvg@fiver,resource_name:post_/partner/v3.0/cfar_offers,response_code:401,response_error_type:clienthttp.response:0|c|#host:fiver,service:flinflon-dvg,env:integration,instance:flinflon-dvg@fiver,resource_name:put_/partner/v3.0/sessions/{}/identify,response_code:204,response_error_type:none
http.response:0|c|#host:fiver,service:flinflon-dvg,env:integration,instance:flinflon-dvg@fiver,resource_name:post_/partner/v3.0/sessions,response_code:400,response_error_type:client
Note the first line has two metrics concatenated without a newline, while the third increment does have a newline.
In our production environment with 1000 RPM, it's causing about half of the counter increments to be dropped, I assume because dogstatsd
doesn't know what to do with the concatenated data in the UDP packet.
Furthermore, I can also reproduce this issue using the vanilla statsd
reporter. This is maybe not surprising since the two reporters are very similar. Example:
flinflon-dvg.fiver.executor_tasks_completed.akka_system.default.env.integration.name.akka_actor_internal-dispatcher.type.ForkJoinPool:0.0|cflinflon-dvg.fiver.http_response.env.integration.resource_name.post__partner_v3_0_sessions.response_code.400.response_error_type.client:0.0|c
flinflon-dvg.fiver.http_response.env.integration.........
Again, note the first two counters are concatenated without a newline separator, while the third starts with a newline.
It's a bit easier to reproduce when the kamon.metrics.tick-interval
is set lower, like 1.second
, because this causes more reports to be sent out with increments of 0. This lines up with @jgagnon1 's observation.
I'm not familiar with the code in question, but I might take a look to see what I can do. @ivantopo might have suggestions?
I worked around this with my own naive PacketBuffer
implementation, and it didn't actually help with my problem. So, something else is going on in my setup at least.
I also realized that my socat
testing method is not definitive since you can't see packet boundaries. I'm going to follow up with some Wireshark investigation at some point.
I noticed that statsd at times concatenates data received before processing it. I.e., if I send sequence of data like
statsd may process single concatenated line
Looks like there's correlation between how many lines are sent and frequency of the above behavior (the faster data are sent, the more often concatenation happens).
I think there might be a race condition possible at this line : https://github.com/kamon-io/Kamon/blob/master/reporters/kamon-datadog/src/main/scala/kamon/datadog/DatadogAgentReporter.scala#L194
But I haven't digged down to see if this method can be called by multiple threads.