timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.91k stars 882 forks source link

[Feature]: chain refresh policies #6010

Open d-sooter opened 1 year ago

d-sooter commented 1 year ago

What problem does the new feature solve?

Continuous aggregates are awesome, and stacking them on top of each other in a hierarchical way is awesome too. There can be funny situations that happen when refreshing the aggregates when one cagg refreshes before the other ones.

This is my setup

cagg (2 min) -> cagg(10 min) -> cagg(1 hour)

Since they are all dependant on the two min cagg it makes sense to refresh them in a way dependant on the lowest cagg. If not in the 10 min refresh runs before the 2 min refresh has run it will no longer pick up the values.

What does the feature do?

The ideas would be to chain the refresh policies so that the dependant caggs are refreshed after the base cagg has been refreshed.

Implementation challenges

No response

jvanns commented 1 year ago

This would definitely make sense. Right now, I account for this by fudging job timer offsets in the configuration but describing an explicit dependency chain would definitely be better.

d-sooter commented 7 months ago

Are there any updates here? I am thinking of creating a function that works its way down to the base table and the update it bottom up.

Open to any ideas/tips

leppaott commented 2 months ago

Would need one refresh policy for that whole chain so that it indeed refresh from bottom to up without needing to guess how long each refresh takes. Secondarily also needs to support refresh_continuous_aggregate(refresh_chain => true) for refreshing manually the whole chain with given top-agg.

Although I guess you can implement this yourself with a job that runs refresh_continuous_aggregates.

nimbit-software commented 1 month ago

Maybe a possibility would be just to chain the jobs and have one job call the next was soon as its done.

Current its not possible to call refresh_continuous_aggregates in a loop from within a job

mkindahl commented 1 month ago

Although it is quite clear what the intentions are here, it is not very clear how this should be designed clearly. In this case, your intention is to make sure that the 10 minute job (for example) starts after the 2 minute job, but after which execution instance?

When the 10 minute job is scheduled, there are already a bunch of 2 minute instances that executed before it, so does that mean that it is OK to execute the 10 minute job? If not, why not? It is perfectly reasonable to have a requirement that the 10 minute job first compute the value on the preceding 2 minute jobs and then compute the 10 minute job.

Also, if we generalize the example and pick something that is not even multiples of each others, what does that mean? If you, e.g., have 3 minute jobs and 10 minute jobs, what does not mean to chain the events?

I think what you want is something that allows jobs that are scheduled at the same time to have a priority. This would require a fixed schedule for the job. It would also require you to be able to give an arbitrary priority to the job, since you might actually want the jobs to execute as outlined above: the 10 minute job should first execute on the preceding 2 minute jobs and then the 2 minute job should execute.