[Feature]: Sliding window cagg aggregation

timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

https://www.timescale.com/

Other

17.46k stars 875 forks source link

[Feature]: Sliding window cagg aggregation #7205

Open leppaott opened 4 weeks ago

leppaott commented 4 weeks ago

What problem does the new feature solve?

https://github.com/timescale/timescaledb/issues/1639#issuecomment-590857755

A common use-case for this feature would be what PipelineDB called 'sliding windows': e.g. we want aggregation for last 30 days (so we want to continuously remove aggregated data which belongs to dates before that). It seems that currently TDB doesn't support it. Using the feature requested in this bug, it can be implemented manually but continuously dropping old data. Having a way to specify the desired time period on the cagg itself would be even better (so that it happens automatically)

What does the feature do?

Drop old data from chunks and re-aggregate (all?) the results according to it so that we could have i.e. "last 10 days max/min scores/prices` in a cagg just by having a sliding window aggregation between now and last 10 days.

Implementation challenges

No response

jonatas commented 3 weeks ago

I'm curious to understand if the rolling is not already doing it.


SELECT 
    bucket, 
    average(rolling(stats_agg) OVER fifteen_min), 
    stddev(rolling(stats_agg) OVER fifteen_min, 'pop'),
    kurtosis(rolling(stats_agg) OVER fifteen_min, 'pop')
FROM (SELECT 
        time_bucket('1 min'::interval, ts) AS bucket, 
        stats_agg(val)
     FROM measurements
     GROUP BY 1) AS stats
WINDOW fifteen_min as (ORDER BY bucket ASC RANGE '15 minutes' PRECEDING);```

leppaott commented 3 weeks ago

@jonatas ERROR: Window functions are not supported by continuous aggregates.invalid continuous aggregate query Yeah I've seen https://github.com/timescale/timescaledb-toolkit/blob/main/docs/rolling_average_api_working.md but we can't do this at query time so should be on a cagg as much as possible. Other than that, it's about what we'd like perhaps without stats_agg. Like for sum() only would be much lighter to just keep one accumulator. Or count(distinct) without toolkit methods.

jonatas commented 3 weeks ago

Like for sum() only would be much lighter to just keep one accumulator. Or count(distinct) without toolkit methods.

Have you seem the other _agg functions? I guess counter_agg can make it as it just stores the accumulated value.

leppaott commented 3 weeks ago

Like for sum() only would be much lighter to just keep one accumulator. Or count(distinct) without toolkit methods.

Have you seem the other _agg functions? I guess counter_agg can make it as it just stores the accumulated value.

Sure, that just counts, doesn't have a sum? But this wasn't about that. All of those are useless if you cannot have the rolling cagg with it. Sure they can be used to rollup() to larger time windows but just need the last 15 mins to stay last 15 mins.

jonatas commented 2 weeks ago

Sure, that just counts, doesn't have a sum?

Yes, it's specialized on count. But I think then stats_agg will be doing the sum.