timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.85k stars 884 forks source link

[Bug]: Unable to create continuous aggregate on a secondary dimension #5744

Open JamieD9 opened 1 year ago

JamieD9 commented 1 year ago

What type of bug is this?

Unexpected error

What subsystems and features are affected?

Continuous aggregate

What happened?

With the following SQL I am able to create a CAGG for a time bucket on the dttm column but not for the dt column (also declared as a timestamp but ideally would be a date).

The error is:

ERROR: time bucket function must reference a hypertable dimension column

I expect the CAGG to be created in both cases.

Furthermore even when using the dttm I am unable to create a CAGG using a time bucket with an offset or origin e.g. in the first CAGG definition replace the time_bucket with: time_bucket('2 minutes'::interval, dttm, origin:='2022-01-01T12:01:00') as start_dttm

TimescaleDB version affected

2.11.0

PostgreSQL version used

14

What operating system did you use?

WSL2 Ubuntu 20.04

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

drop table if exists test cascade;
create table test
(
    dttm timestamp,
    dt timestamp,
    value int
);
select create_hypertable('test', 'dttm');
SELECT add_dimension('test', 'dt', chunk_time_interval => INTERVAL '1 day');

insert into test values ('2022-01-01T12:00:00', '2022-01-01', 1), ('2022-01-01T12:01:00', '2022-01-01', 2), ('2022-01-01T12:02:00', '2022-01-01', 3),
                        ('2022-01-01T12:03:00', '2022-01-01', 4),('2022-01-01T12:04:00', '2022-01-01', 5);

select time_bucket('2 minutes'::interval, dttm), time_bucket('2 minutes'::interval, dttm, origin:='2022-01-01T12:01:00'), dttm,
       time_bucket('5 d'::interval, dt), time_bucket('5 d'::interval, dt, origin := '2000-01-01'), dt from test;

create materialized view testcadttm with (timescaledb.continuous) as
select
    time_bucket('2 minutes'::interval, dttm) as start_dttm,
    count(value) as value
from test
group by start_dttm with no data;

select * from testcadttm;

create materialized view testcadt with (timescaledb.continuous) as
select
    time_bucket('1 d'::interval, dt) as start_dt,
    count(value) as value
from test
group by start_dt with no data;
jnidzwetzki commented 1 year ago

Hello @JamieD9,

Thanks for reaching out. Hypertables are partitioned by using time partitioning (the primary partitioning column) and optional space partitioning (the further partitioning columns). By using add_dimension() you add a space partition to the hypertable. According to the documentation, '_continuous aggregates require a timebucket on the time partitioning column of the hypertable'.

So, the dt column cannot be used to create a continuous aggregate. However, the error message time bucket function must reference a hypertable dimension column should be improved to specifically mention the requirement for a time/primary partitioning column.

Furthermore even when using the dttm I am unable to create a CAGG using a time bucket with an offset or origin

If you like to use an offset in the time bucket function, please use the time_bucket_ng function instead (see https://github.com/timescale/timescaledb/issues/2265 and https://www.timescale.com/forum/t/error-while-using-origin-parameter-in-time-bucket/585 for more information).

In your example, the following works:

create materialized view testcadttm with (timescaledb.continuous) as
select
    timescaledb_experimental.time_bucket_ng('2 minutes'::interval, dttm, origin => TIMESTAMP '2022-01-01T12:01:00') AS start_dttm,
    count(value) as value
from test
group by start_dttm with no data;

test2=# select * from testcadttm;
     start_dttm      | value 
---------------------+-------
 2022-01-01 11:59:00 |     1
 2022-01-01 12:01:00 |     2
 2022-01-01 12:03:00 |     2
(3 rows)

Best regards Jan

JamieD9 commented 1 year ago

Thank you @jnidzwetzki for the explanation, that makes sense & perhaps the experimental function will be able to get us a bit further forward. The actual use case here is that we often have measurements with an observation timestamp that includes some lag, normally this is not a problem but around the end of the day we can see a measurement against the next date with the correct measurement date in another field, in the extreme this delay can be hours-days. We require both the observation timestamp and the metric date for different calculations and some way of being able to create a CAGG over the metric date rather than the observation timestamp would be very handy.

JamieD9 commented 1 year ago

One final note on the space vs time partition I was perhaps a bit thrown by the output of

select * from timescaledb_information.dimensions where hypertable_name = 'test';

+-----------------+---------------+----------------+-----------+---------------------------+--------------+---------------------------------------------+----------------+----------------+
|hypertable_schema|hypertable_name|dimension_number|column_name|column_type                |dimension_type|time_interval                                |integer_interval|integer_now_func|
+-----------------+---------------+----------------+-----------+---------------------------+--------------+---------------------------------------------+----------------+----------------+
|public           |test           |1               |dttm       |timestamp without time zone|Time          |0 years 0 mons 7 days 0 hours 0 mins 0.0 secs|null            |null            |
|public           |test           |2               |dt         |timestamp without time zone|Time          |0 years 0 mons 1 days 0 hours 0 mins 0.0 secs|null            |null            |
+-----------------+---------------+----------------+-----------+---------------------------+--------------+---------------------------------------------+----------------+----------------+

Is the time dimension_type here reflective of the data type rather than the actual dimension type?

Thanks

jnidzwetzki commented 1 year ago

Hello @JamieD9,

Thanks for getting back to us. Your assumption is correct. At the moment, the dimension_type attribute in the view timescaledb_information.dimensions is based on the used data type. Improving the naming is on our roadmap and will improve in further TimescaleDB versions.

test2# \d+ timescaledb_information.dimensions
[...]
View definition:
 SELECT ht.schema_name AS hypertable_schema,
    ht.table_name AS hypertable_name,
    rank() OVER (PARTITION BY dim.hypertable_id ORDER BY dim.id) AS dimension_number,
    dim.column_name,
    dim.column_type,
        CASE
            WHEN dim.interval_length IS NULL THEN 'Space'::text
            ELSE 'Time'::text
        END AS dimension_type,
        CASE