Open magro opened 9 years ago
Hmm... This raises one of my pet peeves about statsd and graphite: that the latency distribution data they currently forward and store is useless for viewing latency distribution, and provides for nonsensical charts and human interpretation. You can see a bit of my ranting on the subject here: http://latencytipoftheday.blogspot.com/2014/06/latencytipoftheday-q-whats-wrong-with_21.html), and in my various "How NOT to measure latency" recorded talks, but I should probably do a specific blog entry about the statsd and graphite issues.
Reporting into statsd itself would probably be useless, as stated does not take percentile inputs. However reporting into graphite "as if jHiccup was statsd" is somewhat tempting: jHiccup could periodically report (e.g. every 5 or 10 seconds) on the same percentile levels that state report on, making it possible for graphite to then store and view that information. The problem there is that periodic percentile reporting in short intervals tends to be useless in itself. Since no accumulated percentile math is possible across the increments, there is therefore no way to report the actual percentiles over reasonably interesting periods and for reasonably interesting percentiles (e.g. the 99%'lie for the past hour) based on short (e.g. 10 second) interval percentile reports.
The "right" thing would be for statsd and graphite to both be able to deal with (and store) interval histograms in their lossless counts form (as jHiccup does in it's .hlog file), such that mathematically valid aggregation of multiple intervals would be possible. [HdrHistigram and it's compressed encoding format would obviously be useful for this]. This would enable graphite to report on arbitrary percentile levels over arbitrary (and sensible) reporting intervals (e.g. the 99%'lie and/or 99.9%'lie over each hour). Basically, with that sort of information being propagated and stored, graphite would be able to produce jHiccup-like distribution charts. This would be useful for much more than jHiccup, as it would apply to any latency information logged into graphite. Without it (and as things stand right now) the percentile reporting on latencies currently logged into graphite is basically non-sensical/useless.
Tackling a new data type for statsd and graphite is a much bigger task than jHiccup. I very well may take it on when I feel like I don't get enough punishment for my current sins. But until I (or someone else) makes it possible to report lossless count based histogram data into those tools, I don't expect any useful information from jHiccup to make it in there...
Hey Gil, hope you are doing fine! I just wanted to share a few words on defense of StatsD...
During the last few days I invested some time in improving our StatsD/Graphite integration and dashboards and got into many little details of how those dashboards were lying to all of us when using these tools without knowing exactly how they work and the outputs they offer on several situations.
StatsD is the part of the stack that I think is more useful, first because it can let you talk to different metric backends at once, simplifying integrations and, second and more important, because up to StatsD (discarding what could be lost when using UDP) the data remains lossless, as StatsD ends up storing all the latency measurements in a big number list up to the point of flushing to the backends. At that point, StatsD has all the information and the reason why the data get summarizes is mainly because the backends don't support that potentially huge amount of data.
I do think that StatsD can definitely benefit from using a port of the HDR Histogram internally instead of that huge amount of numbers in memory, and, of course I would like it to have a protocol upgrade to accept histogram data directly but this bit you can currently workaround by playing with the sampling value as we do in Kamon.
Graphite on the other side is where everything gets messy. It has tons of functions and tools for working with data (nice), but the data that it allows you to store is pretty useless input to those functions (bad). It gives you the chance to configure retention policies for several intervals of time (nice), but once it aggregates the data, everything other than counts, sums, min and max becomes useless so, if you really care about your data you will only have one retention value, we use 10 seconds periods for 7 days by default, no aggregation (really bad). Even while you tried to keep the stored data as clean as possible by no aggregating it, if you create a graph with a period sufficiently large that the individual 10 seconds points are too many to show up, it will consolidate data points together and it can only be done with sum, min, max or average, doing average by default... again, our precious percentiles can be distorted again (really really bad). All this bad points, of course, add to the many shortcomings you already described here and in many other materials.
In my opinion, the source of all evil here is Graphite, not StatsD, but Graphite is everywhere. The project is kind of active, but I hardly believe that they will ever tackle this issues and probably our best hope is embrace newer tools that accept raw data (as InfluxDB) or create our own tools that play well with histograms from top to bottom.
Ivan,
While Graphite's current state is certainly "in the way", I think statsD has two things missing that would need to be fixed for anything upstream from it to be able to work well. And to me this means that statsD's current state is just as much "in the way":
jHiccup is one of those classic examples of 1000/sec latency data points per second. It is impractical to have each jHiccup agents send each data point to statsD, but sampling is not an option for a tool like jHiccup...
Gil,
I totally agree on your second point, but still would like to comment on the first: the reason why StatsD sends summarized percentiles is because that is the best it can do when Graphite is upstream. Graphite wont store more the 1 data point per interval, if you report more than one value per interval for a given metric then only the last value will prevail, e.g. if you have the finest storage scheme of 1 second in Graphite and you were to report all the jHiccup latencies every second directly to Graphite, you will loose all your latency measurements except for the last reported one.
I always thought that the main reason for putting StatsD on front of Graphite was that calculating percentiles or any kind of summaries over many metrics and/or large periods of time in Graphite was too expensive (apparently, it is anyway) and people would prefer to set the percentiles they want to monitor upfront and have StatsD to accumulate, summarize and report those single data points, but after realizing that only 1 data point per metric, per interval is stored in Graphite, then StatsD started to make a lot more sense to me. Unless you have all of your metrics reporting tools flushing at the same interval (or bigger) than the finest storage scheme in Graphite and pray for not to have any unlucky timings that might make 2 packets arrive to Graphite within the same interval, you face the chance of loosing (overwriting to be more accurate) data. Or, you could use StatsD and, if something arrived to StatsD you will know for sure that it will be on Graphite, given that the flush interval and finest storage schema match.
Then, of course all the nonsense aggregation going on in Graphite kicks in and we are back to hell as you described. I just wanted to make the point the StatsD does the best it can do, given what Graphite can accept upstream.
Finally, I wont comment much on the histogram/buckets functionally provided by StatsD as IMHO, I don't find them very useful.. I'm always referring to the plain data stored with timer metrics in StatsD.
Ivan,
I see the statsD/Graphite issue as a chicken and egg problem. You can certainly say that statsD reports percentile summaries because Graphite will only keep one data point (presumably per percentile level). But you can also say that Graphite only deals with this data because nobody (statsD included) offers up anything better.
The way I look at resolving the problem is for the "one 1 data point" that Graphite keeps to be a HdrHistogram of the values recorded during the interval. That's a single, compact value that can be kept per reporting point, while retaining the ability to do useful math across multiple such data points (1 per interval). I see an HdrHistogram as a basic value type for this purpose...
I just wish I had the time to go change Graphite...
Having a HdrHistogram as the unit of storage would be the best possible choice, I totally agree on that!
I'm glad that this issue is being raised more often lately, or at least that's what I am perceiving. Making the people aware of the problems they currently have and naively ignore is a good start on shaping the monitoring tools of tomorrow.
If you ever get the chance to change graphite or persuade it's maintainers to make the move, take for granted that I will promote and use it where appropriate :).
Thanks for the discussion!
While searching for alternatives I found https://github.com/despegar/khronus, which uses hdrhistogram.
What do you think about it?
We're going to use statsd+graphite for monitoring, and I'd like to see data collected by jhiccup there as well. Would it be possible to send data to statsd/graphite instead of logging it to a file?