Datadog's distribution integration ?

fmonniot commented 4 years ago

Hello there,

First thanks for creating Kamon. It’s refreshing to have a tracing/metrics system that work out of the box for the scala ecosystem :)

I am using the datadog reporter at my company, and I was wondering if you folks had considered adding support for the distribution metric type in it? Our use case is computing p99 and being host agnostic.

I looked quickly at the implementation of the datadog agent reporter and it seems it’s already reporting all values as is to the agent. That being said, I’m not certain of the statistical properties of combining HdrHistogram with datadog distributions.

Baring it statistically make sense, would that be something you would consider a contribution to ?

rafaelmagu commented 4 years ago

Us too would like to see this implemented. Kamon metrics account for over 50% of our custom metrics usage in Datadog and switching from pre-computed aggregations to distribution metrics would shed over 80% of that cost.

fmonniot commented 4 years ago

The main problem right now is that datadog does not offer a way to feed them distributions of data directly. The distribution datatype can only be ingested in two ways: sending raw data points or exposing metrics in the OpenMetrics format and enabling a flag in the agent.

Kamon reporter's don't have raw data so the first solution wouldn't work, and the OpenMetrics format (prometheus basically) will force you to choose the distribution buckets. I have contacted the datadog support and they opened a feature request, but I don't know how long this will take.

As a temporary solution, I'm considering forking the prometheus reporter and make the reported buckets the same as the one in the HdrHistograms. Not completely sure about the performance implication of this method though…

Note: there was no response on this thread from the maintainers, but there is one on gitter

ivantopo commented 4 years ago

Hey folks!

Something just caught my attention here:

Kamon reporter's don't have raw data so the first solution wouldn't work

I think there is a misunderstanding here. The Distribution instances that the reporters get have the exact same data that is recorded in the HDR Histograms on the instrumentation side, with the same precision (1% by default). The only differences between the HDR Histogram internal storage and Distributions are:

The HDR Histogram storage is mutable and a Distribution is immutable.
The HDR Histogram storage is a relatively big array that contains all the necessary buckets to cover the desired range but a Distribution uses a compact binary encoding that only records the buckets that actually have values (avoid all the zero-filled buckets).

If it was necessary, a reporter could iterate over all buckets in a distribution and send all the individual measurements to the Datadog agent, which would end up being the same as if every single measurement was sent directly to the agent when it was recorded (as all StatsD-like tools do).

Regarding the OpenMetrics format, isn't that basically the Prometheus format?

I am a bit skeptical about sending tons of individual measurements via UDP because of two reasons: first, UDP is cheap, but it is not free. Throwing the hose of measurements through UDP has noticeable effects on the network stack, which usually ends up causing my second reason for skepticism: data gets dropped, for sure. Even though it is possible to get a distribution on the other side, its accuracy will always be affected by the reliability of the network. This usually is not a problem when sending a few hundred measurements per minute but when things get busy it is more likely for packets to be dropped and distributions to be less accurate.

All that being said, what are your thoughts on moving exposing Distributions via UDP vs using the OpenMetrics format? What are your opinions on the integration as Datadog users?

I'm all in for helping out on this, I just lack experience with the Datadog part to be able to push for this :confused:

fmonniot commented 4 years ago

Effectively, I completely misunderstood what was inside an HdrHistogram 🤦‍♂ That change everything, because it means we could use the current distribution API. There is still two things that might limit us here:

Does the distribution contains all records from the application startup, or since the last reporting time ? If the former then it is definitively possible to implement it in a naive way, otherwise we might have to do some compute to trim down the results.
Is relevant to your second point regarding UDP. I agree that sending all raw data points might work for small metric count but won't passed a certain number of data points. That being said, I have no idea when that packet drop start to really be a problem on a local UDP connection (the agent is running locally) and what are your stand on options which might affect people in some cases.

Given the recent marketing datadog have made of its ddsketch format, I wouldn't be entirely surprised if they would used it as a way to send distribution metrics "per batch". That would solve most of our problem: the reporter would just have to create the sketch with all data point, and just send that sketch to the agent/api. But that's the future :)

In the present, I'm thinking that the UDP strategy might be ok for some. How are you feeling about having so-called experimental flags in the reporter? Or maybe having a fork of the reporter would be a better alternative until datadog offer a better way (it might never come though, I don't know how many people are asking for it) ?

I'll give a try to the UDP strategy next week (and see how it would translate to the HTTP api) and open a draft PR with what I have.

fmonniot commented 4 years ago

Forgot to shime in on the prometheus format: I'm not entirely certain of how it would work with my new understanding of what's inside an hdrhistogram… A naive way would be to create a bucket for each entry in the histogram, but that could create huuuge reports.

fmonniot commented 4 years ago

I received a response from the datadog support team, and apparently there is no immediate plan to support reporting distribution as something else than raw measurements…

As for the UDP part, it does stress the network stack and isn't something I am comfortable pushing to kamon itself without testing it at scale. You could find a naive implementation in my fork (https://github.com/fmonniot/kamon-datadog/tree/distrib-udp-packets) if interested.

kamon-io / Kamon

Datadog's distribution integration ? #612