timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.88k stars 882 forks source link

[Bug]: Continuous aggregates recalculation of historical data #6548

Open TRUSTMEIMJEDI opened 9 months ago

TRUSTMEIMJEDI commented 9 months ago

What type of bug is this?

Other

What subsystems and features are affected?

Continuous aggregate

What happened?

I'm not a native English speaker, but as you read about continuous aggregates in the TimescaleDB documentation, you can see that the documentation contains conflicting information in various places.

My problem:

I have the following hypertable:

create table conditions
(
    timestamp   timestamp with time zone not null,
    datapoint   varchar                  not null,
    country_id    uuid                     not null,
    value       interval                 not null
);

create unique index conditions_timestamp_datapoint_country_id_uidx
    on conditions (timestamp, datapoint, country_id);

select create_hypertable('conditions', 'timestamp');

Following materialized view:

create materialized view country_datapoints_hourly
    with (timescaledb.continuous) as
select datapoint_type_id,
       country_id,
       time_bucket('1 hour', timestamp)       as bucket,
       extract(epoch from sum(value))::bigint as duration
from conditions
group by 1, 2, 3
with no data;

Following continuous aggregate policy:

select add_continuous_aggregate_policy('country_datapoints_hourly',
                                       start_offset => interval '4 hours',
                                       end_offset => interval '1 minute',
                                       schedule_interval => interval '1 hour');

And when new historical data is inserted to the table conditions and timestamp is behind the materialized threshold/watermark, and also behind offset time for the incoming continuous aggregate policy related to these documentations (provided below) I expect data to be reevaluated/recalculated for materialized view.

  1. https://www.timescale.com/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/ Chapter: Invalidation Engine and Materialization Engine

  2. https://github.com/timescale/timescaledb/blob/main/tsl/src/continuous_aggs/README.md Chapter: Bookkeeping and Internal State and The Invalidation Log and Threshold

While different parts of the documentation are stating opposite and reevaluation of historical data can not be done out of the box and it must be done manually by calling CALL refresh_continuous_aggregate

  1. https://docs.timescale.com/use-timescale/latest/continuous-aggregates/troubleshooting/#updates-to-previously-materialized-regions-arent-shown-in-real-time-aggregates

  2. https://docs.timescale.com/use-timescale/latest/continuous-aggregates/troubleshooting/#continuous-aggregate-doesnt-refresh-with-newly-inserted-historical-data

TimescaleDB version affected

2.10.2

PostgreSQL version used

14.1

What operating system did you use?

Ubuntu 22.04 x64

What installation method did you use?

Docker

What platform did you run on?

Google Cloud Platform (GCP)

Relevant log output and stack trace

No response

How can we reproduce the bug?

1

create table conditions
(
    timestamp   timestamp with time zone not null,
    datapoint   varchar                  not null,
    country_id    uuid                     not null,
    value       interval                 not null
);

create unique index conditions_timestamp_datapoint_country_id_uidx
    on conditions (timestamp, datapoint, country_id);

select create_hypertable('conditions', 'timestamp');

2

create materialized view country_datapoints_hourly
    with (timescaledb.continuous) as
select datapoint_type_id,
       country_id,
       time_bucket('1 hour', timestamp)       as bucket,
       extract(epoch from sum(value))::bigint as duration
from conditions
group by 1, 2, 3
with no data;

3

select add_continuous_aggregate_policy('country_datapoints_hourly',
                                       start_offset => interval '4 hours',
                                       end_offset => interval '1 minute',
                                       schedule_interval => interval '1 hour');

4 Add recent data, and call manually CALL refresh_continuous_aggregate 5 Add data before threshold/watermark

konskov commented 9 months ago

hello @TRUSTMEIMJEDI , thank you for reaching out. The expected behavior of a continuous aggregate refresh is to process invalidations within its specified range. So in your example, if there is data inserted that’s older than 4 hours, the refresh will not materialize these.

TRUSTMEIMJEDI commented 9 months ago

@konskov Could TimescaleDB Team update these documents, because currently two different behaviors are described for continuous aggregates?

https://www.timescale.com/blog/continuous-aggregates-faster-queries-with-automatically-maintained-materialized-views/ Chapter: Invalidation Engine and Materialization Engine

https://github.com/timescale/timescaledb/blob/main/tsl/src/continuous_aggs/README.md Chapter: Bookkeeping and Internal State and The Invalidation Log and Threshold

VS

https://docs.timescale.com/use-timescale/latest/continuous-aggregates/troubleshooting/#updates-to-previously-materialized-regions-arent-shown-in-real-time-aggregates

https://docs.timescale.com/use-timescale/latest/continuous-aggregates/troubleshooting/#continuous-aggregate-doesnt-refresh-with-newly-inserted-historical-data