graphite-project / graphite-web

A highly scalable real-time graphing system
http://graphite.readthedocs.org/
Apache License 2.0
5.9k stars 1.26k forks source link

Spike erosion issues with default renderer #1770

Closed nickstenning closed 7 years ago

nickstenning commented 7 years ago

Hi there, I'm trying to work out what my options are for dealing with what seems to me to be a spike erosion (AKA peak erasure) issue with the default Graphite renderer. Here are two plots displayed on two different timescales. First, the past hour:

.../render?target=aliasByNode(stats.app-1.timers.gunicorn.request.duration.upper, 1, -1)&from=-1h

1h

and second, the past six hours:

.../render?target=aliasByNode(stats.app-1.timers.gunicorn.request.duration.upper, 1, -1)&from=-6h

6h

I've taken the liberty of annotating the plots to show the issue I'm struggling with. Namely, that the spike visible in the 1h plot at about 10:00, with a value of ~570, is not visible on the 6h plot at all. Indeed, the scale of the 6h plot does not reflect the max/min values of the data actually stored by Graphite.

I've confirmed that this isn't an aggregation issue. Switching the renderer to type=json, I can find the spike in both the from=-1h and from=-6h outputs:

...
[569.227, 1481623200],
...

For the sake of completeness, however, here are the relevant extracts from storage-schemas.conf and storage-aggregation.conf

storage-schemas.conf

[default]
pattern = .*
retentions = 10s:8d,1m:31d,10m:1y,1h:5y

storage-aggregation.conf

[min]
pattern = \.lower$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.upper(_\d+)?$
xFilesFactor = 0.1
aggregationMethod = max

...

I'm guessing (and it is just a guess) that this is a side-effect of sampling (and possibly averaging) done by the renderer when the number of datapoints for a time range is too high to fit in a plot of a given size. This hypothesis seems to be supported by the observation that providing larger width parameters results in a different vertical scale. The plot displays more variance within the data as the width increases.

I don't think the default behaviour is inappropriate in general, but for certain metrics (such as those representing maximum or minimum) values, it would be nice if there was a way to ensure that the sampling didn't "erase" peaks in the underlying data.

obfuscurity commented 7 years ago

Yes, my assumption (upon reading your tweet) is that this is due to the condensing of data necessary to fit a fixed number of points within the rendered image, and not "exactly the problem" that you alluded to when dealing with percentiles.

My suggestion would be to either widen your graph to a point where consolidation is unnecessary or use the consolidateBy function with the max or min arguments to preserve your peaks.

obfuscurity commented 7 years ago

If you're curious in the consolidation code, you can find the respective bits in HEAD here.

nickstenning commented 7 years ago

Thank you for the link to consolidateBy -- that looks like exactly what I was looking for.

And btw, sorry if my tweets annoyed you. I definitely wasn't trying to point the finger here -- I was just confused by what graphite's solution was to a real and general problem with plotting lots of points efficiently, and I wasn't able to find consolidateBy in the documentation. FWIW I'm pretty sure it is exactly the problem discussed by Heinrich in the post linked -- the percentile aggregation in Circonus seems to solve the same problem as consolidateBy.

obfuscurity commented 7 years ago

Oh I wasn't annoyed so much, I just get frustrated with conjecture... especially with a lack of details. Anyways, I'm glad that we narrowed down the problem and gave you a suitable fix. 👍

dgryski commented 7 years ago

I have plans to move carbonapi over to using https://github.com/dgryski/go-lttb as the default aggregation method.

Dieterbe commented 7 years ago

note that consolidateBy only applies to runtime consolidation. If you're loading data from historical archives that are aggregated, you depend on what the store gives you. Whisper in particular only supports one consolidation function at a time, on a per-series basis, which may not be what you're asking for with consolidateBy. In fact you may be getting data that is off when you combine two different aggregation functions (one set by whisper, one via consolidateBy at runtime) See http://dieter.plaetinck.be/post/25-graphite-grafana-statsd-gotchas/#runtime.consolidation

obfuscurity commented 7 years ago

@Dieterbe As @nickstenning made clear early on, he was experiencing datapoint consolidation within the same archive, not rollups.