timescale / timescaledb-toolkit

Extension for more hyperfunctions, fully compatible with TimescaleDB and PostgreSQL 📈
https://www.timescale.com
Other
376 stars 47 forks source link

Allow aggregation of histograms #352

Open RoryMMMM opened 2 years ago

RoryMMMM commented 2 years ago

Problem I am working with a database of billions of timeseries GPS points. I would like to determine the spatial/temporal/class distribution of certain variables with the end goal to be able to retrieve statistical priors for certain classes in order to find anomalies or fill in blank classes. Example: Show me the distribution of speed for various classes at point X,Y. Return: a histogram per class of the speed in that area.

The way I'm aggregating the points isn't a simple one since there are data issues to worry about (gaps, errors in variables, transmission errors etc) so I'm doing some spatial gap filling and aggregation. In short, it's pretty expensive to generate a histogram for variable A at point X,Y.

Here is a dummy example for a set of weather stations that record temperature over time. Imagine that calculating the histogram for the temperature was very expensive, and was precalculated somewhere else, and that a user would like to calculate the average and histogram over a longer time period:

WITH daily_temp as (
SELECT 
    station_id,
    date(timestamp) as date,
    avg(temp) as avg_temp,
    histogram(temp, -50, 50, 100) as daily_temp_histogram
FROM weather_data
GROUP BY station_id, date)

SELECT 
    station,
    avg(avg_temp) as avg_temp,
    some_aggregate(temp_histogram) as temp_histogram
FROM daily_temp
WHERE date BETWEEN <start_time> AND <end_time> 
GROUP BY station

Solution: I would like to precalculate all the histograms each day and then do an aggregate of histograms to get weekly/monthly/yearly aggregates for users. Having a function that took equal width arrays (of width X) and returned an array, (of width X) that contained the sum or avg of the individual elements would be great. For example:

A = {1,2,3,4}
B = {1,0,0,2}
SELECT histogram_sum(A,B) as X, histogram_avg(A,B) as Y
X = {2,2,3,6}
Y = {1,1,1.5,3}

Alternatives I've looked at materializing the steps I use to create the histograms but the disk size became too large. I'm currently having a look at whether percentile approx functions could work but can't make a call on that yet.

jerryxwu commented 2 years ago

@RoryMMMM, thanks for the feature request. For reference, I'm adding a link to the original conversation thread on the community slack. https://timescaledb.slack.com/archives/C4GT3N90X/p1645698676734519

RoryMMMM commented 2 years ago

I've been looking at the percentile_approx function combined with percentile_agg and it's almost what I need. My issue would be solved if instead of getting the "percentile_approx" from the agg it was possible to retrieve the "probability_approx".

I've taken a look at the UddSketch (I've got no idea how this works....) and I wonder whether it would be possible to write a function that looks at the bucket the input variable falls into, vs the total count of variables. I can't seem to figure out the bin sizes from the json return...

For example:

RoryMMMM commented 2 years ago

Related to https://github.com/timescale/timescaledb-toolkit/issues/205

tucnak commented 1 year ago

You should consider the machine learning use case. We're using PostgresML for both xgboost train and predict-type workloads from SQL directly which accepts a feature vector. It would be nice to use some view over multiple continuous aggregates as training set relation, and histogram— I feel like— has a huge potential as a way to distil certain pre-cooked series data into features that would find its way to a feature vector. We're looking to explore survival analysis which is time-to-event type prediction for our time series forecasting, and more powerful set of tools for time series feature engineering would make our lives so much easier.

Turns out properties of a distribution / frequency domain data is fairly instructive for predictive power!