timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.91k stars 882 forks source link

Continuous Aggregates: support for secondary dimension #6148

Open muntdan opened 1 year ago

muntdan commented 1 year ago

What type of enhancement is this?

API improvement

What subsystems and features will be improved?

Continuous aggregate

What does the enhancement do?

Today only the one time dimension is supported by the Continuous Aggregates. While hyper-tables have support for secondary partitioning dimension. It would be great if Continuous Aggregates could inherit on top of such tables support for the secondary dimmenssion.

Implementation challenges

No response

roessland commented 12 months ago

In the mean time, is there a workaround for this?

Would it make sense to make a bunch of continuous aggregates, one for each partition key value? cagg_0, ..., cagg_f (for example, partition key = first letter of some uuid).

My use-case is that I want to add compression to the materialized hypertable. To avoid slow compression jobs causing timeouts, I would like a reasonable size per chunk. But simultaneously, I do not want too short chunk durations, since the querying a single entity throughout all chunks would require a massive UNION.

nckswt commented 3 months ago

I have a strong use case for this and have found it really hard to work around this problem. I think this is decently related, but happy to start another issue.

I have a fleet of devices that execute "runs" that might last one or more days. I'd love to have those runs aggregated as timeseries data, with one single row per run.

SELECT
  time_bucket('3 days', timestamp) as bucket,
  max(data_signal),
  device_id,
  run_id
FROM timeseries_table
GROUP BY bucket, device_id, run_id 

When I GROUP BY time_bucket, device_id, run_id I'll get multiple rows with the same device_id and run_id if the run has crossed the edge of the time bucket. Setting time_bucket significantly wider (like 30 days) leads to its own issues with materialization and bucket invalidation.

I wish I could do something like

SELECT
  min(timestamp) as start_time,
  max(data_signal),
  device_id,
  run_id
FROM timeseries_table
GROUP BY device_id, run_id 
ORDER BY start_time DESC

for a continuous aggregate. That's not possible yet, right? As far as I'm aware, pg_ivm is the primary way to incrementally update aggregates like the above; I just would much rather all the docs/tools/functionality/support around continuous aggregates instead of that approach.

Screw buckets-only Caggs - min timeseries data is still timeseries data! :joy: