maxDatapoints implicitly applied by metric-tank, but gap-filled by graphite/graphite-raintank

Dieterbe commented 8 years ago

with env-load, created a bunch of endpoints, with all monitors set to 10s. when you request a graph in grafana and set maxDataPoints explicitly to 800, you'll get a metric-tank output like:

2015/12/11 05:03:56 [D] http.Get(): INCOMING REQ. targets: "115.adda3282516de5ce6727a4ce51b63512", maxDataPoints: ""
2015/12/11 05:03:56 [D] ===================================
2015/12/11 05:03:56 [D] HTTP Get()          115.adda3282516de5ce6727a4ce51b63512 1449723836 - 1449810236 (10 05:03:56 - 11 05:03:56) span:86399s. 80 <= points <= 800. AverageConsolidator
2015/12/11 05:03:56 [D] getTarget()         115.adda3282516de5ce6727a4ce51b63512 1449723836 - 1449810236 (10 05:03:56 - 11 05:03:56) span:86399s. 80 <= points <= 800. AverageConsolidator
2015/12/11 05:03:56 [D] type   interval   points
2015/12/11 05:03:56 [D] raw    10         8640   <-- chosen
2015/12/11 05:03:56 [D] runtimeConsolidation: true

i.e. it consolidates the 8640 points to 80<x<800 by returning 1 point every 110s, this can be verified in browser network tab.

However, when graphite-watcher requests 24h worth of data, it seems to get responses like:

...
:1449809048
:1449809058
0.0:1449809068
:1449809078
:1449809088
:1449809098
:1449809108
:1449809118
:1449809128
:1449809138
:1449809148
:1449809158
:1449809168
0.0:1449809178
:1449809188
:1449809198
:1449809208
:1449809218
:1449809228
:1449809238
:1449809248
:1449809258
:1449809268
:1449809278
0.0:1449809288
:1449809298
:1449809308
:1449809318
:1449809328
:1449809338
:1449809348
:1449809358
:1449809368
:1449809378
:1449809388
0.0:1449809398
:1449809408
:1449809418

i.e. a point every 110s, but also a null every 10s.

I think what happens is maxDataPoints is not explicit in this case, but metric-tank just defaults to 800 as a "sensible default" and consolidates, but graphite is not aware of this and fills in the missing points.

i'm not sure how graphite even knows that the original data is supposed to be 10s and starts honoring that, maybe it shouldn't do this. the interesting thing about metric-tank right now is, if it returns raw data, there might be gaps/un-even steps because it just returns what it knows, without effort to clean things up. when it consolidates, it just consolidates every x points together, also without taking into account spacing between them, and if you have gaps, it can hence aggregate larger sections of time together, together with smaller sections for where there's no gaps. we might want to clean this up to be more even, or just do all "formatting/gap-filling" and remove this worry entirely from graphite.

what follows from this,is that if metric-tank's output can be messy, there's also no way for graphite to deduce what the step should be, for any gap filling, and it has to resort to some instructions (metric definition, awareness of consolidation performed by metric-tank, ...)

so this will require some closer collaboration between the two. perhaps it should be up to graphite-api/raintank to set a default maxDataPoints, not metric-tank. this requires some more thought. if MT can be aware of what the resolution should be, it also seems to make sense for gap-filling/data cleanup to happen there, and remove this worry from graphite-api (if it's even possible to stop graphite-api from doing this?)

Dieterbe commented 8 years ago

@woodsaj any thoughts on this?

woodsaj commented 8 years ago

graphite-raintank assumes that the data returned is at the resolution set in the metricIndex stored in Elasticsearch. So it expects data every 10seconds and adds nulls for missing points.

https://github.com/raintank/graphite-raintank/blob/master/graphite_raintank.py#L196-L238

So, we need to a) have metric-tank add nulls for missing values,

or b) have mertic-tank return the resolution of the points, so graphite-raintank can correctly add nulls where needed.

Dieterbe commented 8 years ago

I realized two things. Not a big issue for now, but later we should watch out for:

combining data with different steps
step may change over time, we should track each change in ES as a sort of log instead of just the last known step. so that we have the right step when looking at historical data.

I think ultimately moving over all this logic into MT is the right thing, but for now it seems like if MT just returns the step as part of the result set, overriding the step in graphite-api and keeping the filling logic there should be good enough.

Dieterbe commented 8 years ago

working on this.

Dieterbe commented 8 years ago

https://github.com/raintank/raintank-metric/pull/77 + https://github.com/raintank/graphite-raintank/pull/14 solves this issue for me and makes graphite-watcher 100% happy about the data (no null points, all exactly 1 interval distance between each point, etc)

Dieterbe commented 8 years ago

@woodsaj can you review both PR's please, especially the graphite one :)

woodsaj commented 8 years ago

These PR's have now both been merged into Master.

Dieterbe commented 8 years ago

now graphite-watcher is again getting points at 10s steps. metric_tank running with --agg-settings 600:21600:2,7200:21600:2,21600:21600:2 as usual. i'll look into this.

Dieterbe commented 8 years ago

looks like this is due to the graphite-raintank.py change somehow not making it into the docker image on my computer. need to fiddle more with this.

raintank / graphite-api

maxDatapoints implicitly applied by metric-tank, but gap-filled by graphite/graphite-raintank #8