timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.96k stars 883 forks source link

Support time_bucket_gapfill in continuous aggregates #1324

Open andness opened 5 years ago

andness commented 5 years ago

I tried to create a continuous aggregate using time_bucket_gapfill but was disappointed to find this was not possible. Are there plans to support this?

davidkohn88 commented 5 years ago

This isn't possible, but you can create the continuous aggregate using the normal time bucket and then do time_bucket_gapfill when you query the view, this should pretty closely approximate what you're trying to do.

andness commented 5 years ago

@davidkohn88, I don't think that would solve my case because I need to join on the time column, and to make that efficient I need it indexed.

To give a little background on what I'm doing: I have a property of my data which is stored as intervals in the source data, i.e. "The device was in state X between time T1 and T2". So the way I solve this is brute-force: For each hour of the day I materialize the state that the device is in for that hour. So this means that the rows:

["Device1", "StateA", 12:20, 17:30] ["Device1", "StateB", 17:30, 21:10]

Expand into

["Device1", "StateA", 12:00] ["Device1", "StateA", 13:00] ["Device1", "StateA", 14:00] ["Device1", "StateA", 15:00] ["Device1", "StateA", 16:00] ["Device1", "StateA", 17:00] ["Device1", "StateB", 18:00] ["Device1", "StateB", 19:00] ["Device1", "StateB", 20:00] ["Device1", "StateB", 21:00] ["Device1", "StateB", 22:00]

Using this approach I can get the state of the device for any given hour of the day by a simple join against this table.

Not the most intelligent approach, but it's simple and straightforward at the expense of using some disk space. I tried using a join query that would essentially look up the last device state value for each hour of the day and as you can imagine that query did not perform very well.

anton-brass commented 3 years ago

that only works if you do not have State changes during an full hour, otherwise it will be inaccurate

davidkohn88 commented 3 years ago

I didn't see the previous comment until now, but I will also say that the join, if structured correctly (as a lateral with order by time desc limit 1, and the correct indexes) should be reasonably performant...You could also build a GIST index on a range expression as well as the device_id (with btree_gist extension) and then do a containment query that should also be reasonably performant.

I think there are some other issues around this that I remember hearing about, namely, wanting to calculate the time spent in a given state over a period of time...that one is a little harder to accomplish nicely right now, and I've opened an issue to do that sort of thing in Timescale Analytics: https://github.com/timescale/timescale-analytics/issues/117

danthegoodman1 commented 3 years ago

This is definitely a pain point for us as well. We have to include it outside in the query on the cagg, which complicates the SQL statements we have since we have to double up on some aggregate functions (which don't end up doing anything but makes the query more complicated).

Mykyta-Chernenko commented 1 year ago

Hi, I need it for my use-case because selecting it in runtime is not productive enough, are there any plans to release the functionality?

chrishop commented 1 year ago

I'm also having this same issue, and same requirement as @Mykyta-Chernenko . We need to gapfill within the continuous aggregate for performance reasons.

melicheradam commented 1 year ago

+1 would be appreciated really

leppaott commented 1 year ago

+1, we'd like to use state_agg() and just the LOCF the previous state into next bucket where is no state change otherwise. https://github.com/timescale/timescaledb-toolkit/issues/479#issuecomment-1197493044 things people do seem complicated.

jflambert commented 10 months ago

+1 bummed to discover this has been an open issue for almost 5 years! The workaround isn't doing it for me.

jrodertqc commented 10 months ago

+1 would be very useful to have this feature.

goncalvesnelson commented 8 months ago

+1. This feature would be great and simplify some of our queries and improve overall performance.

leppaott commented 7 months ago

Would be very handy to configure refresh policy to run at every start of bucket and then it could always create the first sample already there.

xerubia commented 6 months ago

Would love this feature, I have the exact same problem that @andness mentioned at the start. Would love to know if there's a viable solution.

johanatan commented 3 months ago

+1 on this feature.

ljt7560868 commented 2 weeks ago

This problem has been going on for 5 years, why hasn't it been resolved yet!!!

landeholt commented 2 weeks ago

gapfill is an extremely important component of timeseries to get linear space. Too bad it can't be used in caggs, why is that? Is it deemed too computional expensive?