influxdata / influxdb

Scalable datastore for metrics, events, and real-time analytics
https://influxdata.com
Apache License 2.0
28.66k stars 3.54k forks source link

Mathematics across measurements #3552

Open srfraser opened 9 years ago

srfraser commented 9 years ago

Apologies if this is a duplicate, I had a look and couldn't see a relevant issue.

I can see from the documentation how to select from multiple measurements (although it calls them series, still, at https://influxdb.com/docs/v0.9/query_language/data_exploration.html )

For example, with data inserted by telegraf, you can do: select * from disk_used,disk_total where host = 'myhostname' and path = '/'

How would you express that as a percentage? I've tried variations of the following, and none seem to work:

select disk_used.value/disk_total.value from disk_used, disk_total where host = 'myhostname' and path='/'

The "mydb"."retentionpolicy"."measurement" syntax doesn't work there, either.

Is it a good idea to add aggregation functions for cases like diff(value1, value2) from m1, m2 and divide(value, value) from m1, m2, or should the arithmetic operators be working?

Also, I noticed when experimenting that it's also not possible to divide one derivative by another. For example, if I have two counters, bytes transferred and api calls made - both of which are constantly going up - how would you calculate the mean bytes per api call?

hexluthor commented 9 years ago

:+1: I work with sensor networks and find this limitation frustrating. For example, I wish to compute weighted averages like this: SELECT sum(oxygen_percentage.value * flow_rate.value) / sum(flow_rate.value) FROM oxygen_percentage, flow_rate WHERE site_id = '3' But InfluxDB returns nothing. Even SELECT oxygen_percentage.value FROM oxygen_percentage doesn't work. Using 0.9.3-rc1 master (0163945).

ghost commented 9 years ago

Same here. I'd also like to calculate values across different series like:

select * from mysql_value where type='mysql_commands' and type_instance='show_tables' + select * from mysql_value where type='mysql_commands' and type_instance='show_databases'

Cheers, Szop

bbinet commented 9 years ago

same as @hexluthor, I feel this is very limiting: if we need to correlate data coming from various sensors we currently have to write all data as fields in the same measurement... But would it be a good idea in terms of data structure to have a single measurement with more than 50 fields? Will it impact query performance? And this sensor data does not always get logged with the same sampling frequency, so this is not always possible to combine data in the same measurement if we want to keep data with high sampling frequency.

I'm not comfortable with distorting the data structure (dropping natural data organization) because of technical limitations. In the sysadmin world, it would be like putting all the cpu, ram, disk, and apache response time metrics in the same measurement for the sole purpose of being able to correlate apache response time with cpu, ram, or disk metrics.

bbinet commented 9 years ago

Also, what are the actual technical issues that prevent InfluxDB to support queries with simple math operations across measurements?

corylanou commented 9 years ago

This was recently changed to a "feature request" so that means it will be evaluated in future releases if we are going to add it or not. There are a couple work arounds right now, and that is to save a calculated field when you write data, such as storing another field for oxygen_percentage.value * flow_rate.value. I understand this isn't ideal, but it might get you moving forward.

Otherwise, I think these requests are sane, but they will take some work. I believe sum() / sum() is supposed to work already, but I thought I remember seeing a bug about math still not behaving properly.

bbinet commented 9 years ago

@corylanou about the work around you're talking about: the oxygen_percentage.value * flow_rate.value field should be created when new points are created or is there a way to compute the calculation afterwards in a continuous query?

corylanou commented 9 years ago

Yes, I believe you should be able to do that in a CQ and then you can select from that retention policy.

srfraser commented 9 years ago

How can we do it in a continuous query? I thought the syntax of normal queries and continuous ones was the same, so if it's possible in one, it should be possible in the other.

corylanou commented 9 years ago

instead of sum(value & value), you are doing a CQ with select val * val as newval and then you can select sum(newval) from your new data that was calculated from a CQ.

srfraser commented 9 years ago

And that works across measurements? Using @bbinet's example, this would work?

select oxygen_percentage.value * flow_rate.value as newmeasurement from oxygen_percentage, flow_rate 
corylanou commented 9 years ago

Hmm, it should, but I just tried this basic test and it crashed the server :cry:

> create database math
> use math
Using database math
> insert mul a=1,b=2
> select * from mul
name: mul
---------
time                            a       b
2015-09-21T12:17:36.377625368Z  1       2

> select a*b as c from mul
ERR: Get http://localhost:8086/query?db=math&q=select+a%2Ab+as+c+from+mul: EOF

I logged another issue here: https://github.com/influxdb/influxdb/issues/4183

srfraser commented 9 years ago

and that was only from one measurement :)

corylanou commented 9 years ago

Hopefully this is a central bug in our post-processing that when fixed will fix all of it. I'll see if I can fix it today. It appears to be just a bad reference while putting the math together, so it might be a quick fix.

bbinet commented 9 years ago

Thanks @corylanou, but as @srfraser said in his previous comment, your example comes from the same measurement: is it supposed to work with multiple measurements? I thought that queries running as continuous were the same as normal queries so if maths does not work across multiple measurements in a normal query, I thought it won't work neither in a continuous query. Is it wrong?

corylanou commented 9 years ago

Ah, yes, I keep forgetting we don't calculate across values. Although in a simple query we should support this. The biggest problem is type checking and overflow so that when you take an unsigned int and multiple it by a float, etc. that we are able to properly convert to a common type for the math, and not overflow either.

bbinet commented 9 years ago

Ok, I see. That would be great if cross measurements calculation could be possible at least for series which shares the same type (since no type conversion would be needed)

drmclean commented 8 years ago

+1 We REALLY want this for our use-case!

thunderstumpges commented 8 years ago

+1 here too! https://groups.google.com/forum/#!topic/influxdb/B1q-x5uUqTg

malnor commented 8 years ago

+1, really missing this feature.

Millnert commented 8 years ago

+1

alintuhut commented 8 years ago

+1

clongbottom commented 8 years ago

+1

xaniasd commented 8 years ago

:+1:

deepujain commented 8 years ago

:+1:

plieningerweb commented 8 years ago

:+1:

cxreg commented 8 years ago

+1

migibert commented 8 years ago

+1

graphex commented 8 years ago

+1 thought I was going crazy, but this is a pretty substantial omission that might mean I've got to use another project instead of influx. Many times there is just no way to get correlated information into the same measurement. Even after an arduous journey with CQs, I only found that tags aren't included in CQ writes so there is no way to even fan-in with multiple CQs. Why were the MERGE and JOIN features from 0.8 dropped without there being a replacement? With the 0.9 documentation recommending the optimal way to structure things is to have many series and a single field named “value” (or some other key of your choice) used consistently across all series. and there apparently being no way to migrate from that kind of structure to the sort recommended at https://docs.influxdata.com/influxdb/v0.10/concepts/schema_and_data_layout/ I'm worried we're left hanging.

A viable CQ approach would be OK, but it is a lot more work than simply joining time-grouped measurements at query time, the way that influx used to work.

catchagain commented 8 years ago

@graphex not that it solves the main problem, but tags are in fact included in CQ writes if the CQs have something like group by time(30m), * in them.

adrianlzt commented 8 years ago

+1

HarasimowiczKamil commented 8 years ago

:+1:

cyberflow commented 8 years ago

+1

jsternberg commented 8 years ago

The proposed syntax above likely won't work since it conflicts with another potential query.

> insert cpu value.host=2
> select * from cpu
name: cpu
---------
time                    value.host
1460469115659777269     2

This seems to currently be a valid query. @pauldix any ideas what syntax we should use for this kind of feature?

andremiller commented 8 years ago

I'm also looking for a way to do math between two measurements.

I've got one measurement for Volts, another for Amps. The data is being provided by two different pieces of equipment. I would like to multiply the Volts value with the Amps value (time correlated) to get a calculated Watts value.

ukclivecox commented 8 years ago

+1

kepi commented 8 years ago

:+1:

martinb69 commented 8 years ago

+1 Really need this as not having this is great miss.

phil-fu commented 8 years ago

+1

ArturasRa commented 8 years ago

+1

mjad-org commented 8 years ago

+1

nmilford commented 8 years ago

+1

mrecht commented 8 years ago

+1

skburgart commented 8 years ago

+1

sbengo commented 8 years ago

+1

habnabit commented 8 years ago

Can I request that this issue be locked? I'd like to receive notifications about actual updates, and not just random "+1"s.

jsternberg commented 8 years ago

Done. If you are a person interested in this issue, please add a 👍 reaction to the top of the message instead of a +1 comment and then click "Subscribe".

jsternberg commented 8 years ago

I am unlocking this to continue meaningful discussion on the issue. Please refrain from adding meaningless +1's. This is a high traffic issue with a lot of subscribers. If you want to express your approval for the feature, please use a reaction at the top of the issue.

Unfortunately, locking an issue also locks GitHub reactions. I did not know that when I locked it.

meenakshi-panda commented 8 years ago

Hi,

I am currently using Grafana and InfluxDB for monitoring purposes. I have two measurements. Measurement 1 : Domain,Available Capacity, Threshold Measurement 2 : Domain, Peak TPS My use case is plot graph if the Peak TPS exceeds Threshold. Here i am dealing with two measurements. Can you please suggest how can i use data from two measurements to plot the graph when the condition is satisfies.(Peak TPS > Threshold)

bal2ag commented 8 years ago

I have a web service which calls multiple downstream services on each request (the services called may change based on the request). I have various timing measurements in my application's components: cache put/get, call time for each downstream service, etc.

The ability to perform mathematics across these measurements would enable rich and sophisticated graphs to expose data such as what % of the request is spent in each timing component while allowing the measurements to still be independent of each other (I don't want to include all timing metrics as fields in a "request_timings" measurement because some timings are independent of the web request - for example, Redis/cache timing metrics are not just used per request but by other application components).

More importantly this enables alarming on arbitrary "calculated" or "derived" measurements which is extremely useful for creating precise, unambiguous alarms.

trequartista10 commented 8 years ago

This feature is extremely important in time series market data for stocks and in other financial systems. Consider this simple example, you have market data tick with say price and size. You want to derive notional value across all ticks something like (price * size). This seems infeasible in current setup. Also, joining on timestamps and tags, could be error prone.

The current schema seems to well only for independent measurements like sensors or cpu etc.