[[feature collection]] requested Functions and query operators

beckettsean commented 8 years ago

This issue contains a list of related feature requests that are not on the near-term roadmap. The feature requests in this issue are all new functions that have been requested. If you want to request a function not already listed please make a comment on this issue, and we will add it to the checklist.

Aggregations

[x] moving averages function: https://github.com/influxdata/influxdb/issues/77
[ ] histogram https://github.com/influxdata/influxdb/issues/3674
[x] cumulative counts https://github.com/influxdata/influxdb/issues/813
[ ] HDR histogram https://github.com/influxdata/influxdb/issues/739
[ ] histogram with log-scaled buckets https://github.com/influxdata/influxdb/issues/984
[ ] time weighted aggregations https://github.com/influxdata/influxdb/issues/7445
[ ] logarithmic mean https://github.com/influxdata/influxdb/issues/2482
[x] integrals https://github.com/influxdata/influxdb/issues/1400
[ ] cumulative distribution function: https://github.com/influxdata/influxdb/issues/7261
Selectors
[ ] faster, slightly less accurate percentages calculation https://github.com/influxdata/influxdb/issues/513
[ ] lag variables https://github.com/influxdata/influxdb/issues/142
[ ] lead variables https://github.com/influxdata/influxdb/issues/143
[ ] upper bound, lower bound https://github.com/influxdata/influxdb/issues/659
[ ] average "gauges" with awareness of time spent in state https://github.com/influxdata/influxdb/issues/1115
[x] mode https://github.com/influxdata/influxdb/issues/1823
Transformations
[x] trigonometric functions (sine, cosine, etc.) https://github.com/influxdata/influxdb/issues/659
[ ] hyperloglog https://github.com/influxdata/influxdb/issues/717
[x] dot-product https://github.com/influxdata/influxdb/issues/5095
Operators
[x] difference https://github.com/influxdata/influxdb/issues/1825
[x] elapsed time between events https://github.com/influxdata/influxdb/issues/1856
[x] exponents and logarithms https://github.com/influxdata/influxdb/issues/659
[x] ceiling and floor functions https://github.com/influxdata/influxdb/issues/3691
[ ] trend operators: linear and least square https://github.com/influxdata/influxdb/issues/5358
[x] DIFFERENCE of two fields https://github.com/influxdata/influxdb/issues/761
InfluxQL enhancements
[x] general support for nested functions https://github.com/influxdata/influxdb/issues/834
- [x] percentile + derivative https://github.com/influxdata/influxdb/issues/5150
- [ ] percentage of values above/below given value (inverse PERCENTILE)
- [x] Top accepts nested functions https://github.com/influxdata/influxdb/issues/2467, https://github.com/influxdata/influxdb/issues/5345
[x] CAST into different data types https://github.com/influxdata/influxdb/issues/2911
[ ] fill(previous) without GROUP BY https://github.com/influxdata/influxdb/issues/3633
[ ] User defined functions https://github.com/influxdata/influxdb/issues/6891
[ ] Boolean type cast https://github.com/influxdata/influxdb/issues/7562

tbowmo commented 8 years ago

Is there any timeline for these functions? (I'm very interested in the aggregate integral functions, to calculate kWh from watts). Seems that a feature request was opened a year ago.

beckettsean commented 8 years ago

There is no timeline for each specific function/feature. All work on new functions was on hold while the query engine was refactored, and that refactoring was merged into InfluxDB 0.11. We plan to introduce a few functions with each release from now on.

timgriffiths commented 8 years ago

Would also be great to get #3633 Added to your list this feature would be very useful on the client side when graphing data.

beckettsean commented 8 years ago

Thanks, @timgriffiths, added!

dengliu commented 8 years ago

Filed #6208 would like to see holt-winter and some build-in time series anomaly detection functions in influx QL. E.g. https://blog.twitter.com/2015/introducing-practical-and-robust-anomaly-detection-in-a-time-series http://robjhyndman.com/hyndsight/yahoo-data/

beckettsean commented 8 years ago

@dengliu can you open a new issue so I can reference it here?

dengliu commented 8 years ago

@beckettsean I have updated the issue id in my previous comment.

beckettsean commented 8 years ago

@dengliu as referenced in the other issue, anomaly detection functions should go into Kapacitor. The results can be fed back into InfluxDB for visualization, but that doesn't require any new functionality in InfluxDB.

nelg commented 8 years ago

Awesome work. I'm +1 for #813 - I've got data in Influxdb that I'd like to run it on.

pavelpep commented 8 years ago

+1 for exponents and logarithms #659

brandoncazander commented 8 years ago

The lack of a mode function is really hurting us in switching from 0.8.9. In particular, we need to be able to select the most common occurrence of a specific string over a group by, which we have to do in post-processing now.

beckettsean commented 8 years ago

@brandoncazander COUNT(DISTINCT()) will give you the frequencies for each period in the GROUP BY, and you can pull the largest number from that client-side. Not a solution but perhaps a workaround.

fluffy commented 8 years ago

The IOT stuff we are doing really needs integrals. Any idea on timeline of when we might get this?

sferrett commented 8 years ago

+1 for lag (or some kind of per-series time-shift) so that we can show current and -1week data on the same panel in grafana.

something like: select mean(value), mean(lag(value, 1w)) from metric where time > now() - 1d group by time(1m)

would return two fields, being the per minute averages for the last 1d and for the same period but one week prior.

If there is already a way to do this, I would appreciate being hit with the clue-stick.

beckettsean commented 8 years ago

@sferrett That's an excellent suggestion. Would you mind creating a new issue for it so we can track that separately? That's a bit beyond the level of adding a simple function.

@gunnaraasen any slick Grafana tricks to accomplish the same goal with current InfluxQL?

gunnaraasen commented 8 years ago

Grafana has a per-graph setting to add a time shift (see the screenshot below). I don't think it's possible to shift a single series within the same graph.

mrecht commented 8 years ago

+1 one for adding more math functions

From the top of my head the function would be:

sin, cos, tan
asin, acos, atan
correlation
variance / covariance
sqrt
exp
abs
log, ln
div, mod

mitar commented 8 years ago

Would it be possible to create some way to extend InfluxDB with custom functions? Maybe to load some dynamic library at load time? This could allow community to experiment with different functions?

beckettsean commented 8 years ago

@mitar Kapacitor supports arbitrary User Defined Functions in almost any language, and it can be used to batch-process InfluxDB data.

mitar commented 8 years ago

Hm, but to my understanding, Kapacitor would have to read the whole span of a time-series to be able to compute the custom downsampled version of data, every time I would try to read it at a lower resolution?

beckettsean commented 8 years ago

@mitar Kapacitor can store its results in InfluxDB, just like a CQ. Instead of running a CQ inside InfluxDB, you would just have Kapacitor do a batch query, process the UDF, and then write the results back into the downsampled measurement in InfluxDB.

mitar commented 8 years ago

But I do not want to do this for every query users come up with, at different timespans with different downsampling? I like InfluxDB because it can compute downsampled values efficiently without moving data around. Reading InfluxDB into another process and then storing it back seems inefficient to me? Or are you saying that there is not much performance difference between InfluxDB doing downsampling on the fly, and Kapacitor doing it? Using Kapacitor to downsample the same amount of data takes similar resources than using InfluxDB to do such a query?

mrecht commented 8 years ago

@mitar my thoughts exactly. The goal should be (at least that's mine) to do as much as possible on the db without sending data over the network. I would like to store the data in the finest granularity available and the derive data from it. And then only send over the derived data to the client. In this (my) case drived data are not only aggregated time buckets (e.g. seconds built from ticks), but computations build on those buckets.

beckettsean commented 8 years ago

@mitar I cannot think of a way to allow for ad hoc queries using user defined functions. We do accept PRs for new functions, or you can perform the function calculations client-side for now.

beckettsean commented 8 years ago

@mrecht sending as little data around the network as possible is our goal, too, but opening up core InfluxDB for arbitrarily defined user functions introduces too many challenges around maintaining performance and stability. I don't expect UDFs to be part of InfluxDB at any time in the 1.x version line.

@pauldix are UDFs for InfluxDB anywhere on the timeline yet?

mitar commented 8 years ago

So what about allowing InfluxDB to load some dynamic code (.so, .dll) to do that? Many other databases allow modules which extend the behavior.

So maybe there should just be a way to load a custom module, and then you could say "module:foo" which would simply call that module's function as a function in InfluxDB?

pauldix commented 8 years ago

UDFs aren't on the timeline yet. Best bet is to use Kapacitor for that for now.

beckettsean commented 8 years ago

@mitar @mrecht I encourage you to open a new issue for user-defined functions in InfluxDB. It's not that we think it's a bad idea and are saying no, it's just horribly complex and not a current priority.

mitar commented 8 years ago

Done: #6891

MatMeredith commented 8 years ago

Really need SUM to support nested functions such as MEAN. See issue https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/influxdb/uXdYy9JA6_E/cZRzQo7FBAAJ. This is likely to prevent us from deploying InfluxDB. As noted in that issue though this is just one respect in which InfluxDB's query language seems sorely limitted!

nordewal commented 8 years ago

In regards to @MatMeredith comment: this is the one biggest missing feature for us as well!

ghost commented 8 years ago

As to the issue #5150, you can get around that with a continuous query that does a

NON_NEGATIVE_DERIVATIVE(MEAN(value)), 1s) INTO

a new measurement and then pull a 95th or whatever percentile out of it. This is perfect for a 95th percentile of interface traffic calculation, but alas, Grafana cannot yet draw that as a line sice the percentile is really a point, not a series of values.

mvadu commented 8 years ago

@beckettsean Can you please add #5345 to this? Its related to Top aggregate.

mvadu commented 8 years ago

@beckettsean also #6723 to make it easier to query on specific patterns of time windows (e.g. Every day 9AM traffic pattern).

ddimtirov commented 8 years ago

A good addition would be exponential moving average (EMA), useful to minimize the MA jumping when a large value enters and exits the window.

https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average

I was also a bit uncertain about the overlap between top/bottom and min/max - would be nice if the manual mentions if there is any performance difference between these two, besides the nicer syntax when your use case fits the min/max

bal2ag commented 8 years ago

Histograms are for continuous numeric data. What about discrete non-numeric data?

Specifically, an example use case:

I have a measurement which contains some data of unbounded dimension (say, userID). I store it in a field to avoid exploding my series cardinality, and because I don't really need to query on this data for a specific userID. What I really want is to be able to see how many points occur in this series for each userID. This could be displayed nicely in a bar graph which shows, say, the 5 userIDs with the highest number of points for this series.

I apologize if this is already possible with current functionality, but I haven't found a way to do it without selecting the field (and thus all the points) directly from the measurement, and transforming it into aggregate counts on the client side.

mitar commented 8 years ago

@bal2ag: Yes, I need that as well. I opened something similar for RRDTool as well: https://github.com/oetiker/rrdtool-1.x/issues/261

kostasb commented 8 years ago

We have received user requests describing the need for a latest selector function: #7089

ngsankha commented 8 years ago

Can geometric and harmonic mean functions be also added to the list? They are useful for estimating non-normal distributions.

Will-Beninger commented 8 years ago

Casting was implemented however I think this fairly important use case was skipped out on: select mean(value)::integer from firstmeasurement

Currently it's only possible to do the following: (2 steps)

select mean(value) into secondmeasurement from firstmeasurement   
select mean_value::integer into thirdmeasurement from secondmeasurement

Is it possible to add support for this feature because it otherwise necessitates a CQ or the use of Kapacitor to get this feature.

When trying to do so now you receive:

select mean(value)::integer from firstmeasurement
ERR: error parsing query: found ::, expected FROM at line 1, char 15

yoyomikeyc commented 8 years ago

Modulus division would be greatly appreciated.

ie, the % operator.

natemurthy commented 8 years ago

@beckettsean I would also like to see #6723 added to your list as well. This would allow for better retrieval of periodic or seasonal data. What's the processing for vetting this feature for your near-term roadmap?

andyfeller commented 8 years ago

@beckettsean Are there any plans of supporting MySQL pattern of SUM(IF(EXPR, 1, 0)) commonly used for conditional aggregating?

MariaDB [test]> select * from http_requests;
+----+-------------------+---------------+
| id | url               | response_code |
+----+-------------------+---------------+
|  1 | http://google.com |           200 |
|  2 | http://google.com |           200 |
|  3 | http://google.com |           200 |
|  4 | http://google.com |           200 |
|  5 | http://google.com |           200 |
|  6 | http://google.com |           200 |
|  7 | http://google.com |           404 |
|  8 | http://google.com |           503 |
|  9 | http://google.com |           500 |
+----+-------------------+---------------+
9 rows in set (0.00 sec)

MariaDB [test]> select url, sum(if(response_code = 200, 1, 0)) as success, sum(if(response_code between 400 and 499, 1, 0)) as failures_client, sum(if(response_code >= 500, 1, 0)) as failures_server from http_requests group by url;
+-------------------+---------+-----------------+-----------------+
| url               | success | failures_client | failures_server |
+-------------------+---------+-----------------+-----------------+
| http://google.com |       6 |               1 |               2 |
+-------------------+---------+-----------------+-----------------+
1 row in set (0.00 sec)

My original plan was to create a CQ where I could aggregate a # of requests made in a time frame to calculate aggregate successes and failures in a single measurement.

I know that Kapacitor lambda supports IF operator, however was hoping InfluxQL might natively support this. This has the flavor of #4619 except it involves supporting expressions.

Thanks for consideration!

AndreCAndersen commented 7 years ago

I feel the point "dot-product #5095" is too narrow in this feature collection. That feature request originally talked about using arithmetic operations inside of aggregates in general, not only the dot-product in particular. The dot-product was just a specific example of the problem, i.e., SUM(a*b), but there are plenty of other similar operations like SUM(a/b+c) or, in my case, the so called "micro_price" in an order book: MAX((bid*bid_vol+ask*ask_vol)/(bid_vol+ask_vol)) (or the low MIN): http://quant.stackexchange.com/a/24510/2792

pgeiger-xx commented 7 years ago

I've started work on integral function. Are there any quants or devs watching that have an opinion about default # of slices or a good rule of thumb relating data points to slices?

wwentland commented 7 years ago

Is anybody working on histogram() ? It is a crucial function for us unfortunately and doesn't seem to be the focus of any development work at the moment.