Open srfraser opened 9 years ago
As others have mentioned, this is a highly requested and important feature. I use influx for recording metrics from my house and home server lab, of of the things I track is whole-house energy consumption. I capture wattage and Voltage coming off the meter, which does fluctuate. Currently in grafana, I'm simply dividing currant by 124 (average voltage rating I get) to get the currant in amps, but since voltage fluctuates between 120 and 125 or so, I'd like my current rating to be more accurate by being able to divide the two series. Graphite has this, can we please get this added?
IANAIE (Influx Employee) but IIUC this feature is especially tricky for Influx compared to most TSDBs because there's no defined time resolution. So if you're trying to join a series at 13-second resolution with one at 19-second resolution, You're Gonna Have a Bad Time (prime numbers chosen to accentuate pain).
Now, if it happens that you know for an ironclad fact that two of your series have exactly the same time granularity and cadence-- in this limited case it's safe and relatively easy to do a join. But Influx doesn't carry this kind of metadata, or constrain the points that you write to any particular resolution.
So, in the general case, in order to make progress you have to invent a whole DSL for deciding stuff like how to dedupe fields and combine values when two points snap to the same rounded timestamp.
I think a good way to start cutting into this would be to support joins between two members of the set { subqueries, measurements derived from a CQ } that have the same "group by" criteria (tags and, critically, time).
It would be nice if simple expressions could be used to adapt the tag values and paper over syntactic mismatches.
It would also be nice if you could collapse any aggregates that appear on both sides, e.g. "(foo.A + foo.B) as APlusB", "avg(foo.A, foo.B)", etc.
A series is defined as a single field organized by a collection of tags contained within a measurement. When you write multiple fields using the line protocol, this writes multiple series to the measurement with the same set of tags.
When selecting from a single measurement and selecting from multiple fields, an implicit join is done on the time of each point in the series for the fields that are requested.
At the moment, when multiple measurements are used, they are not joined together and are instead listed one by one. Ideally, measurements would automatically be merged and look like this.
Ideally, measurements would automatically be joined on time for each of these measurements. For example, with this, you might get:
cpu value=1 0
cpu value=2 10
cpu value=3 20
mem value=4 0
mem value=5 10
mem value=6 20
> SELECT * FROM cpu, mem
name: cpu
time cpu.value mem.value
---- --------- ---------
0 1 4
10 2 5
20 3 6
This is how it should look and how it should return. While the name is included, it mostly doesn't mean anything when the query returns like this.
In the existing system, series can be grouped into multiple series
instead of just a single series by using GROUP BY <tag>
. We will still
continue to support this and this will continue to be an important way
of organizing series.
So if I have the following:
cpu,host=server01,region=useast value=1 0
cpu,host=server02,region=useast value=2 0
mem,host=server01,region=useast value=3 0
mem,host=server02,region=uswest value=4 0
> SELECT * FROM cpu, mem GROUP BY host
name: cpu
tags: host=server01
time cpu.value mem.value
---- --------- ---------
0 1 3
name: cpu
tags: host=server02
time cpu.value mem.value
---- --------- ---------
0 2 4
The series that have the same host are grouped together. If you want to match a differently named tag, then you may have issues since I don't think we'll support that for the initial iteration of this feature. We would need to provide some way to tell the query that you want to interpret one of the tags as a different tag and it is likely not worth trying to get that syntax correct if people don't need that functionality.
As can be seen from this example, the measurements from 'server02' do not match the same server. One is from region 'uswest' and one is from 'useast'. Because we are not grouping by region, these can be matched together. If you wanted the query to match the region too, you would do this and get the following output:
> SELECT * FROM cpu, mem GROUP BY host, region
name: cpu
tags: host=server01,region=useast
time cpu.value mem.value
---- --------- ---------
0 1 3
name: cpu
tags: host=server02,region=useast
time cpu.value mem.value
---- --------- ---------
0 2
name: cpu
tags: host=server02,region=uswest
time cpu.value mem.value
---- --------- ---------
0 4
Notice how the measurement name is still cpu
. When using a join, the
measurement name is frankly irrelevant and should be discarded, but we
aren't going to because a measurement name is required and is used in
some locations.
Be aware that if you GROUP BY *
it will include all possible tags.
That means if you have one measurement with 3 tags and one with 2 tags,
they may never match because the measurements with 2 tags will have an
empty value for the third tag and will never match.
Instead of these being joined, they will be returned as separate series and the measurement name is considered important. Using this paradigm, it's not really possible to join the measurements in a way that makes sense.
cpu value=1 0
cpu value=2 10
cpu value=3 20
mem value=4 0
mem value=5 10
mem value=6 20
> SELECT * FROM cpu, mem
name: cpu
time value
---- -----
0 1
10 2
20 3
name: mem
time value
---- -----
0 4
10 5
20 6
The new syntax will force the ideal behavior to occur, but we cannot break backwards compatibility. If this proves to be a good output format, we may consider using this format in the future.
For the initial implementation, we need to implement OUTER and INNER
joins, although it's unclear if we need to use those words. Kapacitor
implements them implicitly by using INNER if fill(none)
is specified
and using OUTER for any other fill option. Since we default to
fill(null)
, I think we should default to an OUTER join and do the same
thing as Kapacitor.
SELECT * FROM cpu, mem JOIN MEASUREMENTS
JOIN
would be a new token, but MEASUREMENTS
is a currently existing
token. Rather than try to do something like this:
SELECT * FROM cpu JOIN mem
I am proposing we use something more friendly to the old syntax. This way, it is more natural to include support for regexes and multiple measurements. For three measurements, I think this is awkward:
SELECT * FROM cpu JOIN mem JOIN gpu
With a regex, it also doesn't match since the implication is that each entry refers to a single measurement.
# Does this join cpu and gpu or not?
SELECT * FROM /[cg]pu]/ JOIN mem
The proposed syntax takes away that ambiguity by telling the query to join all of the measurements within the measurements list.
This will fix math for even within the same measurement. Right now, there are no particular rules for how these measurements are joined together to perform intra-measurement mathematics. That means if you try to do math on multiple fields within the same measurement, they can get out of sync very easily and there is no way to control how these behave.
This will also implement the feature requested in #7216 since intra-measurement joins will begin to start using the JOIN functionality for their behavior. This is an additive change and is not a breaking change because the default fill option is null which is what it currently does. This will just allow a person to customize this.
The existing return value for multiple measurements will not be changed since that would be a breaking change. Users will have to use the new syntax for telling InfluxDB to perform the join for measurements rather than just within a measurement.
Kapacitor implements tolerance margins so that if two points are close enough in timestamps, they will be joined in the case that their timestamps are within the tolerance rather than requiring the timestamps to be exact. Additional points will just follow the same rules that were set by fill.
When a field name conflicts between multiple measurements, which should be used or should both be used? With standard SQL, the query will return an error saying that it is unable to resolve the ambiguity between the two. For conditions, this may be necessary since we don't want to guess which value they intended in a situation like this:
SELECT * FROM cpu, mem WHERE value > 2
We could transform this to one of the following though:
SELECT * FROM cpu, mem WHERE cpu.value > 2 AND mem.value > 2
SELECT * FROM cpu, mem WHERE cpu.value > 2 OR mem.value > 2
I'm not sure either of these would be proper though and there wouldn't really be a way for us to determine which of these the user intends. But when we use a regex, we expand it similar to how we expand wildcards. We also expand wildcards. If we have a conflict, it would be possible to transform the following:
SELECT value FROM cpu, mem
Into this statement:
SELECT cpu.value, mem.value FROM cpu, mem
This would be the same as if we had done this with the original.
SELECT /^value$/ FROM cpu, mem
But is this a behavior that the user would expect? That's less clear. And since this behavior would be different from SQL, it's not clear that it's something a user would expect.
I would be tempted to make this an error for the initial implementation and then we can expand it later, but some feedback on what others think would be very helpful here.
Testing with SQL, SQL does not seem to care if you use an implicit JOIN and don't use one of the values inside of the JOIN. Because of that, we're just not going to care either.
Example:
sqlite> .headers on
sqlite> CREATE TABLE a(id integer, host varchar(256));
sqlite> CREATE TABLE b(id integer, region varchar(256));
sqlite> INSERT INTO a VALUES(1, 'server01');
sqlite> INSERT INTO b VALUES(2, 'useast');
sqlite> SELECT a.id FROM a, b;
id
1
sqlite> SELECT * FROM a, b;
id|host|id|host
1|server01|2|useast
This seems like an easy enough pattern and, since it has precedent, seems like something we should do.
Please leave any feedback you have about the proposed syntax and behavior. If you like the proposal and think it works for your use case, please leave a reaction. No :+1: comments since that will make discussion harder.
I updated the above comment to include information about how grouping works. This is important because it determines how to get similar series to match with each other which would be very important for this feature.
Notice how the measurement name is still cpu. When using a join, the measurement name is frankly irrelevant and should be discarded, but we aren't going to because a measurement name is required and is used in some locations.
Perhaps we could create a new measurement name by concatenating the measurements from the SELECT
clause? E.g.
> SELECT * FROM cpu, mem GROUP BY host
name: cpu|mem
tags: host=server01
time cpu.value mem.value
---- --------- ---------
0 1 3
I'd advocate for using a rare character as the join, rather than -
or _
. That will make it somewhat less ambiguous
The proposed syntax takes away that ambiguity by telling the query to join all of the measurements within the measurements list.
Works for me. Also removes any worry about OUTER
vs INNER
since the user never specifies those.
We could transform this to one of the following though:
SELECT * FROM cpu, mem WHERE cpu.value > 2 AND mem.value > 2 SELECT * FROM cpu, mem WHERE cpu.value > 2 OR mem.value > 2
In standard SQL users are quite used to using table.column to return values in queries with more than one table. I think for InfluxQL, a similar syntax works well. If a field or tag that is NOT part of the join is used, then the user must specify the parent measurement.
Tolerance Margins
I do think we should include these, but I would be fine with waiting for a future release. The initial implementation can join only on explicit timestamp matches. Let's debug the rest of the syntax and behavior before letting users create 1h
tolerance joins on data.
Question: What does it mean to do something like:
SELECT * FROM cpu, cpu GROUP BY host
I personally don't like the idea of concatenating the measurement names. I'm not convinced measurement names should even exist anymore. I noticed Grafana will concatenate the measurement name and the column name and I think we could probably get that working so they ignored the measurement name if it was empty and just used the column. When we start returning the measurement name in the column output, I think that's probably preferable. Although there is a bit of a question. If I do this:
SELECT mean(cpu.value) AS mean_value FROM cpu
Should the column be cpu.mean_value
or mean_value
?
I think for something like:
SELECT * FROM cpu, cpu GROUP BY host
That would be rewritten to:
SELECT cpu.value, cpu.value FROM cpu GROUP BY host
Although we still don't have the best idea of what happens when you combine the same measurement from multiple databases/retention policies. It seems to me like the measurement should be tied to the field rather than being placed into the FROM
clause. I'm not entirely sure how to do that syntactically, but having the FROM
come at the end is syntactically ambiguous in these cases. I've been wondering if it should be rewritten like this:
SELECT db0.autogen.cpu.value FROM db0.autogen.cpu
I think it might be cleaner and a bit more flexible to use a slightly different syntax. Why not make join
a function call? So you can do:
SELECT value
FROM join(0, mem, cpu), foo
-- or
SELECT value
FROM join(/foo.*/)
Then you could have multiple measurements returned. You'd also be able to specify the tolerance as the first argument. The function join
could be an outer join and then later if we want to add inner we'd just add that as function inner_join
.
I agree that if the user has something in the WHERE
clause that matches against a field, it should return an error like SQL does if it's ambiguous.
I think the issue with doing it like that is you would then be mixing two different types of output for two queries that have to work completely differently. What would the output of this example return?
SELECT value FROM join(0, mem, cpu), foo
We have the tolerance in there, which should be an optional parameter since many people probably won't care too much about it, and then we have two measurements. Makes sense, but then we have another measurement attached to the end. We also have only one variable and it is ambiguous who it refers to for the join, but makes complete sense for foo
.
So the mechanics of join isn't really that of a function. It's not a simple transformation of data within the query language, but it acts more like changing the functionality completely.
As an example syntax for tolerance, I was thinking something like this personally:
SELECT cpu.value, mem.value FROM cpu, mem JOIN MEASUREMENTS WITH tolerance(1s)
Most of these tokens are already reserved keywords. We could decide if tolerance()
should be a function or not, but I think it reads decently well. I think it also sets us up to make JOIN MEASUREMENTS
the default whenever we decide to make a breaking change.
The measurement on the end would be another result set just like it is now if you select FROM foo, bar
. The variable for tolerance is between all series being joined. If you wanted to have different tolerances, then we'd need to support nesting like: join(0, foo, join(2, mem, cpu))
No breaking changes until 2.0 (which we're not even thinking about yet).
I guess it's not that big of a deal to be able to do join(0, mem, cpu), foo
since you can get the same thing from doing a multi-statement query:
select value from mem, cpu JOIN MEASUREMENTS;
select value from foo;
I'm not that married to it since your last suggestion accomplishes much the same thing and it's already too late for me to shoehorn my functional dream into the query language. Thoughts from others?
While attempting to implement this, I ran into another issue which I'll describe here and I'll describe the solution that we determined would be most appropriate. For those following this issue closely, my apologies for not writing up the problem before we started talking internally so anybody who wanted could contribute. I'm writing it up here for the sake of transparency before we attempt to implement it.
The first thing to understand is how series are joined together currently. When you have a single measurement with multiple fields like the following, we need to join the values together in some way.
cpu,host=server01 idle=20,user=70,system=10 0
cpu,host=server02 idle=30,user=50,system=20 0
When we execute a query to retrieve all of the fields, we are querying 6
different series. There are 3 fields and 2 series keys for a total of 6
series, but we are only returning 2 rows at time zero. How do we
determine which series are joined together? According to the other
proposal, the series would just be joined together in a fairly random
order so you could intermix values from cpu,host=server01
with values
from cpu,host=server02
. There are a few problems with this though.
But, if we don't use this method of joining together series, joining
series between measurements becomes very complicated. Imagine that you
have the following in the mem
measurement and join it with the above.
mem,host=server01,region=uswest free=80 0
Since this series does not have a series key that matches anything
above, the value would not be joined with any row in cpu
above. While
we could force series keys to be the same across measurements for this
feature to work, we think that's too limiting.
We are proposing a new syntax and new functionality to join. When series
are being joined (either between multiple measurements or a single
measurement), the user will be able to specify which parts of the series
key should be considered while joining. Only tags that are mentioned in
the join can be used in the resulting query (including GROUP BY
). The
join will happen before anything else so series will be joined into a
single row before they are grouped into different series. I think this
contrasts with Kapacitor which does the grouping first and then joins
the series. /cc @nathanielc
The syntax will be something like this:
SELECT cpu.idle, cpu.user, mem.free FROM cpu, mem JOIN SERIES ON host
JOIN MEASUREMENTS
has been changed to JOIN SERIES
to put the focus
on how we are joining series together. Series from different
measurements are not treated in a special way.
Series will be joined together with the cartesian product for any values
at the same time. So if you tell a query to join on host
and have 2
series that match with 2 series in the other field, you will end up with
4 rows total. In particular, that means queries like this will work in a
way we think is most fitting.
cpu,host=server01,num=0 value=20 0
cpu,host=server01,num=1 value=30 0
mem,host=server01 value=10 0
> SELECT cpu.value, mem.value FROM cpu, mem JOIN SERIES ON host
time cpu.value mem.value
---- --------- ---------
0 20 10
0 30 10
GROUP BY
is used, is it reasonable to infer that it should be
included in the JOIN
so the user doesn't have to repeat it?For now, we're going to keep things explicit and try to make this more user friendly with feedback in the future. If you can think of any problems or you think that this is the wrong way to deal with joining measurements, please post a comment so we can address any concerns. This is a difficult problem and we appreciate and welcome any community involvement.
When there is a tag selected in the query, is it reasonable to infer that series should be joined on that tag?
As long as "Only tags that are mentioned in the join can be used in the resulting query" is true then I think this is reasonable and in fact necessary.
When a GROUP BY is used, is it reasonable to infer that it should be included in the JOIN so the user doesn't have to repeat it?
I think this is reasonable. I don't think a query would make sense if the GROUP BY tag weren't also part of the join.
Given the cross-product output, what would I get from the following query?
cpu,host=server01,num=0 value=20 0
cpu,host=server01,num=1 value=30 0
mem,host=server01 value=10 0
> SELECT COUNT(cpu.value), COUNT(mem.value) FROM cpu, mem JOIN SERIES ON host
???
I was originally thinking the answer would be 2 for both of those columns, but we had another discussion today and @pauldix is doing a writeup of how he thinks the mechanics for this should work. We've been throwing around the idea of making this only work for aggregates rather than allowing raw query math with join, but we're unsure if that would solve the main use case people have.
For all of those following this issue, we are having difficulty at the moment with scoping the issue and determining what exact use cases people need. We understand one use case to be doing something like correlating the number of requests with the number of errors using aggregates like this:
SELECT count(errors.value) / count(requests.value)
But I would also like to hear other use cases. Specifically, the above one is more about performing aggregates on two different measurements rather than doing math between those measurements. I would like to hear more use cases involving math between measurements and why you are trying to do it that way so we can determine if we are working on the correct feature and designing it correctly.
Thanks.
We're using InfluxDB to store measurements gathered from the LUT Green Campus. We mainly gather measurements related to energy, which typically means turning voltage and current measurements to power. The electrical system has three phases (L1
, L2
, L3
), which are used as fields in the schema. An example query to calculate active_power
from measured voltage
, current
and phi
for a single phase L1
would be something similar to
SELECT mean(voltage.L1) * mean(current.L1) * mean(cos(phi.L1)) GROUP BY time(5m), panel_group
Furthermore, the values for apparent_power
and reactive_power
can be derived from these three measurements.
As another data point on use cases we have exactly the same use case as @kipe at Flywheel Building Intelligence.
Any chance for those who just "+1"ed to remove their comments and +1 the OP instead?
It would make it easier to follow the discussion. Thanks.
Hey Guys,
Another use case which I got was calculating a percentage, where the denominator value is in one measurement and the numerator is in another, and I needed to do something like
SELECT (sum(a.value) / sum(b.value)) * 100
where a and b are separate measurements, which is similar to what you guys mentioned. Is there no roundabout way to do this for now?
@thealphaking01 for your example, is that supposed to be sum(a.value) / sum(b.value)
or is the second value supposed to be the raw values?
@jsternberg a.value is a subset of b.value. Didn't get what you meant by raw values? For a particular duration, sum(a.value) gives me the number satisfying a criteria, while sum(b.value) gives me all possible values. Hence the percentage calculation.
Raw values are the ones directly written into the system. So SELECT value FROM cpu
returns the raw values for the value field in the cpu measurement. The reason why I ask is because I didn't knonw if your above example had a typo or not since a.value
is inside of an aggregate and b.value
is not.
Sorry, that was a typo. Fixed it now.
I'm not sure if this is the correct case to implement the issue of case #3525 for inequalities? I would like to do something along the lines of:
SELECT mean(value) > 40
where the result is returned as a full time series of boolean values. My specific use case for this is determining if a pump is operating based on the current draw it is using.
Opening a feature request kicks off a discussion. Requests may be closed if we're not actively planning to work on them.
Proposal: Implement SQL Joins in the Influx Query Language Current behavior: Not supported. Desired behavior: Supported.
Use case:
When requesting data from InfluxDB it would be very useful to combine metadata, configuration data, etc., from one "table" with the time-series data. For example:
assets (measurement) time (actually not used in this table) id (tag) = cryptic internal identifier friendly (field) = character string port (field) = number (TCP/IP port number used by the device)
i.e., insert assets,id=y268938rjnau3 friendly='Asset1',port=64200
rundata (measurement) time id (tag) = cryptic internal identifier temperature (field) pressure (field)
This works great: select last(temperature) from rundata group by mac
But this is what I really want:
Timestamp Machine Temperature
10/05/2016 8:04:06pm C2475 1675.4 10/02/2016 9:02:11am C7524 850.5
That might be done like this:
select c.time as "Timestamp", a.friendly as "Machine",last(c.temperature) as "Temperature" from rundata c,assets a where c.id = a.id group by c.mac
I dont know if this should be a separate issue, but we would really like to be able to do simple stuff like this:
SELECT non_negative_derivative(last(PortXmitData * 4), $interval) from ibstats WHERE "interface" =~ /ib.*/ AND "host" =~ /$hostname/ and $timeFilter group by time($granularity),interface,host fill(none)
Note the PortXmitData * 4; this blows up today. The reason for this specific reason is that the metrics from IB equipment are returned as "octets divided by 4” (¼ of the actual number of Bytes)" which Grafana does not know how to deal with (because its a stupid unit). However, there are other cases where doing this sort of simple arithmetic is very useful, and hopefully this is a fairly simple feature to implement.
One other feature that is commonly requested is the ability to plot a metric against the mean/95%ile for each item in a GROUP BY. Any method to achieve that is fine by us.
I understand your issue… makes sense… the same issue in my opinion. But I don’t see tx_bytes*4 in your query…
I dont know if this should be a separate issue, but we would really like to be able to do simple stuff like this:
SELECT non_negativederivative(last(PortXmitData), $interval) from ibstats WHERE ("interface" =~ /ib./ OR "interface" =~ /p._/) AND "host" =~ /$hostname/ and $timeFilter group by time($granularity),interface,host fill(none)
Note the tx_bytes * 4; this blows up today. The reason for this specific reason is that the metrics from IB equipment are returned as "octets divided by 4” (¼ of the actual number of Bytes)" which Grafana does not know how to deal with (because its a stupid unit). However, there are other cases where doing this sort of simple arithmetic is very useful, and hopefully this is a fairly simple feature to implement.
One other feature that is commonly requested is the ability to plot a metric against the mean/95%ile for each item in a GROUP BY. Any method to achieve that is fine by us.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/influxdata/influxdb/issues/3552#issuecomment-257987465, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVlReR3OTMcZ_DcCy4OJy3ea4qU58p5Pks5q6PFqgaJpZM4FljNk.
@inselbuch my apologies, I fixed the example query.
The problem is that this works: non_negative_derivative(last(PortXmitData), $interval) And this does not: non_negative_derivative(last(PortXmitData * 4), $interval)
+1
Edit: +1ed the OP
+1
I am currently in the process of trying to switch from graphite to influx with grafana as our front end.. I currently have queries like this for graphite but I havent found a way to convert them to influx yet..
asPercent(reporting1_dev.memory.used,reporting1_dev.memory.total)
asPercent(reporting1_dev.memory.buffers,reporting1_dev.memory.total)
Also like this
asPercent(nonNegativeDerivative(reporting1_dev.cpu.total.iowait),nonNegativeDerivative(sumSeries(reporting1_dev.cpu.total.*)))
asPercent(sumSeries(nonNegativeDerivative(reporting1_dev.cpu.total.{guest,iowait,nice,steal,irq,softirq})),nonNegativeDerivative(sumSeries(reporting1_dev.cpu.total.*)))
Any help would be great!!! Thanks! Josh
So I found that I if I am doing math all of the info has to be in the same table.
+1 we need to do some maths between measurements
This is a huge feature missing from InfluxDB, making it much less powerful than Prometheus.
+1
+1
+1
I think the point has been stated with all of the +1s... This thread should be locked until there is progress on it.
+1
I'm locking this again. While that means you won't be able to add your 👍 reaction, I think there are enough of them that we're well aware people want this feature. We do want to hear any resources that may be useful in terms of implementing this. You can look in the commit log to find my email. If you want to wait for comments regarding this issue, please use the "Subscribe" button on the issue instead of responding to the issue.
Thank you.
Apologies if this is a duplicate, I had a look and couldn't see a relevant issue.
I can see from the documentation how to select from multiple measurements (although it calls them series, still, at https://influxdb.com/docs/v0.9/query_language/data_exploration.html )
For example, with data inserted by telegraf, you can do:
select * from disk_used,disk_total where host = 'myhostname' and path = '/'
How would you express that as a percentage? I've tried variations of the following, and none seem to work:
select disk_used.value/disk_total.value from disk_used, disk_total where host = 'myhostname' and path='/'
The "mydb"."retentionpolicy"."measurement" syntax doesn't work there, either.
Is it a good idea to add aggregation functions for cases like
diff(value1, value2) from m1, m2
anddivide(value, value) from m1, m2
, or should the arithmetic operators be working?Also, I noticed when experimenting that it's also not possible to divide one derivative by another. For example, if I have two counters, bytes transferred and api calls made - both of which are constantly going up - how would you calculate the mean bytes per api call?