timeseries functions - Githubissues

tr8dr commented 7 years ago

I've been testing timescaledb on a few hundred million rows of data and happy with performance so far. One area where the feature set is lacking is with some basic functions often present on timeseries and/or analytic databases:

median
histogram
percentile
first / last
cumsum

While there are (more complex) analogs for cumsum and first/last in Postgres, would be useful to have these available as efficient operations on a group, as applicable.

Aside from Spark and a couple of other "big data" platforms, was previously using Netezza which has a fairly large library of analytical functions.

cevian commented 7 years ago

Hi, thanks for your issue. I think this point to some improvements we could do. We would be eager to work with you to improve our available functions/interfaces. Let me go through each issue one after the other:

For median/percentile, there are already functions in Postgres to do this. Namely, percentile_cont (https://www.postgresql.org/docs/current/static/functions-aggregate.html). An example query for the median cpu_usage would be:

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY cpu_usage) 
FROM cpu

For first/last, if you want the entire first/last row, you can add the row_number() and then filter by the row_number as appropriate. But I agree it would also be useful to have first/last as aggregates. Then you could do things like:

SELECT host, last(cpu_usage ORDER BY time)  
FROM cpu
GROUP BY host

Would that syntax be what you would like?

For cumsum, not sure if you already knew but the syntax is: sum(sum(column)) OVER(ORDER BY group) For example,

SELECT host, sum(sum(cpu_usage)) OVER(ORDER BY host) 
FROM cpu 
GROUP BY host;

We will document this case better in our docs.

For histograms, would you want an output of one-row-per-histogram-bucket (i.e. http://tapoueh.org/blog/2014/02/21-PostgreSQL-histogram) or some kind of array-typed return that would represent the histogram as one column (i.e. https://wiki.postgresql.org/wiki/Aggregate_Histogram)?

Do you have any example interfaces/syntax for histograms in other systems that you really like?

tr8dr commented 7 years ago

Ok, thanks. Netezza, which is based on postgres, uses percentile_cont as well. In looking on the web saw that people had written functions to sort and then select in implementing this, but perhaps was not available in an older version of postgres. Also the window functions (at least on netezza) could not be used in certain contexts when combined with grouping.

So sounds like the functionality either missing or inconvenient would be: histogram, first, last.

cevian commented 7 years ago

PR #43 adds a first, last function. Sample usage:

SELECT device_id, last(cpu, time) 
FROM metrics 
GROUP BY device_id;

It wasn't clear to me the best interface for a histogram function. Do you have any examples you particularly like?

mfreed commented 7 years ago

Also, I just tested that the aggregate histogram function that @cevian linked to does work successfully with TimescaleDB. E.g.,

SELECT device_id, COUNT(*), histogram(cpu_avg_1min, 0.0, 100.0, 5) 
   FROM readings 
   GROUP BY device_id ORDER BY device_id LIMIT 5;

 device_id  | count |        histogram        
------------+-------+-------------------------
 demo000000 | 10000 | [0:4]={9010,930,0,60,0}
 demo000001 | 10000 | [0:4]={9103,831,0,66,0}
 demo000002 | 10000 | [0:4]={9009,931,0,60,0}
 demo000003 | 10000 | [0:4]={9097,827,0,76,0}
 demo000004 | 10000 | [0:4]={8972,949,0,79,0}

That said, we'll still plan to add a native histogram() in the future.

mfreed commented 7 years ago

@tr8dr Our new docs (released earlier this week) highlights the use of these functions:

http://docs.timescale.com/api#advanced-analytics

mfreed commented 7 years ago

Going to close off this issue unless additional questions?

timescale / timescaledb

timeseries functions #30