vimeo / graph-explorer

A graphite dashboard powered by structured metrics
http://vimeo.github.io/graph-explorer/
Apache License 2.0
1.06k stars 93 forks source link

a way to graph as a percentage of a total #121

Open X-Trade opened 10 years ago

X-Trade commented 10 years ago

Quite often we want to display a graph in the context of a total. e.g. percentage of 404 responses out of all responses.

A stacked graph achieves this visually but there are situations where this still does not help:

Take collectd's cpu usage. It reports jiffies as a counter (we had to tweak the stock plugin to get this to work at all). The problem here is that jiffies are not necessarily a constant number per second. If we stack all of those counters together, we get spikes above and below the natural maximum line.

So what I propose is a function 'percentage of' or 'percent by' that would work similarly to 'sum by'. Then we could query for e.g: 'stack collectd_plugin=cpu sum by core group by server percent by type' For each graph the values would be represented as a sum of the total of all types, for each group.

Dieterbe commented 10 years ago

so percent by type, assuming you have three types (idle, sys, usr), would show 3 lines?

line 1: idle / (idle+sys+usr)
line 2: sys / (idle+sys+usr)
line 3: usr / (idle+sys+usr)

?

X-Trade commented 10 years ago

Yeah, that sounds about right.

Dieterbe commented 10 years ago

oh i forgot that it would be converted to a percentage, is that what you meant with 'about right' this seems like a really neat idea, and we should implement it, i don't think i have time for it anytime soon though

bnkr commented 10 years ago

Do you think you could give a basic idea of what to do to make a pull request for this?

It looks to me like interpreting the user's "percent by" query would be in graph_explorer/test/test_query.py and the building of a graphite render/ parameter is in graphs.build_from_targets . ( build_from_targets doesn't seem to have a lot of coverage, though).

To be clear, I would eventually like to end up with something like:

target=asPercent(
  derivative(
    sumSeries(collectd.server.cpu-*.cpu-idle))
, derivative(
    sumSeries(collectd.server.cpu-*.cpu-*))
)

Can you think of anything else I'd need to worry about?

Dieterbe commented 10 years ago

graph_explorer/test/test_query.py is some unit tests for the query parsing. the query parsing is in graph_explorer/query.py, you can extend that. graphs.build_from_targets is indeed where all the magic happens. the function looks daunting, maybe, but it's just a series of steps and every step is explained, but you'll probably need to put prints in there to follow along with what's happening. not sure why you need the derivatives though. just asPercent(<fraction>,<total>) should work.

this thing is going to be trickier than straight up sum by / avg by though. with those we just fetch all matching metrics, and then bundle them together in sum() and avg() calls (and as we add them to a sum or avg, we remove the entry from the list of to-be-independently-rendered-metrics, because the list of to-be-independently-rendered-metrics will contain the sum/avg that contains it already)

in this case there's 2 options:

for a given foo cpu percent by type query, let's say we get 3 resulting metrics:

server=foo collectd_plugin=cpu type=idle
server=foo collectd_plugin=cpu type=usr
server=foo collectd_plugin=cpu type=sys

A) we add to the list of to-be-independently-rendered-metrics 3 things: aspercent(idle,idle+usr+sys), aspercent(usr,idle+usr+sys) and aspercent(sys,idle+usr+sys)

B) we add to the list of to-be-independently-rendered-metrics only specifically asked for things like maybe we only want idle, aspercent(idle,idle+usr+sys) but in this case we can't just add 'idle' to the query because that would filter out usr and sys from the retrieved metrics. so we would need this in the query syntax like, maybe percent by type:idle

B may sound like an overcomplication, but ultimately we should have a way of expressing this, because there will be use cases where you don't want to see the percentage for every possible value of a tag.

the other thing is that you can't just do the logic i described for avg/sum by, because every metric will appear multiple times (not just either as an independently rendered metric or as part of an aggregate)

but if you're decent at python this shouldn't be an issue, you can just keep track of all the values, so you know how the sum looks like , and then use that in all the aspercent's you want to render.

anyway depending on how ambitious you are, i would start with the simplest case, A. we can add the filtering for B later.

bnkr commented 10 years ago

That's really helpful information. Thanks.

I'm not sure I understand the use case for percent by type:idle. It would seem more logical to make a query like collectd_plugin=cpu type=idle percent by type, and this would give you graphs for idle on each machine. I'm pretty new to the query language so I may be missing something.

The derivatives are in my example because collectd cpu metrics are counters of total scheduling units so it put it in by habit only.

In any case I will let you know how I get on by the end of the week.

Dieterbe commented 10 years ago

collectd_plugin=cpu type=idle percent by type would result in only metrics that have type=idle, and then later, in the build_from_targets() we wouldn't have the other metrics with different types, so we couldn't include them in the generated asPercent statements. although we could actually deal with this, the query parsing/executing logic would be aware of the percent by <TAG> and then make sure we fetch all metrics with whatever value they have for <TAG> even when you also have <TAG>=somevalue in the query.

but anyway, i wouldn't worry too much about that now.

bnkr commented 10 years ago

That makes sense. It seems to be very useful to be able to graph things in the context of other metrics which you don't want to show so I'll look at that as well, but I expect I'll just do the original idea until I find some concrete use cases to test against.

X-Trade commented 10 years ago

From a user point of view, I think I would expect that: collectd_plugin=cpu type=idle percent by type - would give me idle only as a percentage of sum of all types collectd_plugin=cpu percent by type:idle - would give me values for all types as a percentage of the idle value. In this case normally you might not want to see the 'idle' value because it is always going to be 100%, but as long as we can still filter that out in the query it's fine..

I understand that may technically be wrong though and maybe I don't fully grasp the syntax of the query language.

Dieterbe commented 10 years ago

your first query, agreed. it just means we have to support the gathering of extra metrics as i described in my previous comment. your second query, i like that. it seems quite useful and also consistent with the buckets feature of sum by /avg by (implementing buckets for percent by could also be useful), i suggested this syntax to do what your first query does, but this way is better I think.