[Bug]: TimescaleDB always performs a sequential scan in chunks instead of index

timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

https://www.timescale.com/

Other

16.8k stars 852 forks source link

[Bug]: TimescaleDB always performs a sequential scan in chunks instead of index #7071

Open Bykiev opened 6 days ago

Bykiev commented 6 days ago

What type of bug is this?

Performance issue

What subsystems and features are affected?

Query planner

What happened?

TimescaleDB always performs a sequential scan in chunks instead of index

TimescaleDB version affected

2.15.2

PostgreSQL version used

14.3

What operating system did you use?

Windows 10

What installation method did you use?

Other

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

Create a table:

CREATE TABLE IF NOT EXISTS public."Data"
(
    "ParameterId" numeric NOT NULL,
    "Value" double precision NOT NULL,
    "Quality" bit(1) NOT NULL,
    "Dt" timestamp with time zone NOT NULL
)

Generate random data in CSV:

 COPY (
 select t.ParameterId, t.Value, t.Quality, t.Dt
 from (SELECT 
    generate_series(1,180000000) AS Id, 
    round(random() * 15000) as ParameterId,
    random() * 1000 AS Value,
    ((round(random()))::integer::bit) as Quality,
    timestamp with time zone '2019-05-31 00:00:00' +
        random() * (timestamp with time zone '2024-05-31 23:59:59' -
                    timestamp with time zone '2019-05-31 00:00:00') as Dt) as t)    
    TO 'D:\data.csv' CSV HEADER;

Insert data into table:
COPY public."Data" FROM 'D:\data.csv' CSV HEADER;

Create extension:

CREATE EXTENSION IF NOT EXISTS timescaledb;

Create hypertable with chunk interval 6 months:

SELECT create_hypertable('"Data"', by_range('Dt', INTERVAL '6 month'),
    if_not_exists => TRUE,
    create_default_indexes => FALSE,
    migrate_data => TRUE);

Create Index:

CREATE INDEX IF NOT EXISTS "IX_Dt_ParameterId"
    ON public."Data" USING btree
    ("Dt" DESC NULLS FIRST, "ParameterId" ASC NULLS LAST)
    WITH (deduplicate_items=True)
    TABLESPACE data;

Analyze query plan of such query:

SELECT "ParameterId", max("Dt")
 FROM "Data"
 where "Dt" <= '2024-05-23 04:24:04'
 group by "ParameterId"

For some reason only seq scan on chunks instead of index scan. Why did it happen and how to perform index scan on each chunk?

nikkhils commented 5 days ago

@Bykiev

your query basically includes ALL the chunks and almost all the rows from all the chunks because of your WHERE clause. So since all the data is being pulled out indexscan is not very attractive to the planner.

where "Dt" <= '2024-05-23 04:24:04'

That is the reason [parallel] sequence scans get used in the query. If you tweak the query as below then you will see the index scan being used as desired by you.

EXPLAIN SELECT "ParameterId", max("Dt")                                                     
FROM "Data"                                                                                                               
where "Dt" > '2024-05-23 04:24:04'                                                                                        
group by "ParameterId";

Bykiev commented 5 days ago

Thanks, this makes sense. But because of this the only solution I can see is to create multiple tables instead of partitioning and search by index in each table. Seems timescaleDB is not good for such cases where we need to get the latest value by each group on specific time point

Bykiev commented 5 days ago

I found some other strange things: I've created a new hypertable where min date is "2022-01-01 00:00:00.387116+03" and max date is "2023-12-31 23:59:58.739976+03". Why does it created a 3 chunks for hypertable by_range('Dt', INTERVAL '1 year') (I think there should be 2 chunks)? When I executing the query with condition where "Dt" >= '2022-01-01 00:00:00' and "Dt" <= '2022-11-30 04:24:04' explain show that it'll scan 2 chunks sequentially? Why does it happen with timescaleDB?

jedwards1211 commented 5 days ago

To get the max over a large arbitrary time range efficiently with timescale, wouldn't the best way be to create hierarchical continuous max aggregates, and select a max from among the aggregate buckets covering your time range?

For example if you had 1 second, 1 minute, 1 hour, 1 day aggregates, and you want to get the max where "Dt" <= '2024-05-23 01:01:01', you could get the max of:

the 1 day aggregates <= 2024-05-23
the 1 hour aggregate from 2024-05-03 00:00 to 2024-05-03 01:00
the 1 minute aggregate from 2024-05-23 01:00 to 2024-05-23 01:01
the 1 second aggregate from 2024-05-23 01:01:00 to 2024-05-23 01:01:01.

I've been prototyping something similar to this to compute integrals over arbitrary large time ranges efficiently.

Bykiev commented 5 days ago

Thanks, this looks promising, thought the additional disk space is needed for aggregates

Bykiev commented 5 days ago

I found some other strange things: I've created a new hypertable where min date is "2022-01-01 00:00:00.387116+03" and max date is "2023-12-31 23:59:58.739976+03". Why does it created a 3 chunks for hypertable by_range('Dt', INTERVAL '1 year') (I think there should be 2 chunks)? When I executing the query with condition where "Dt" >= '2022-01-01 00:00:00' and "Dt" <= '2022-11-30 04:24:04' explain show that it'll scan 2 chunks sequentially? Why does it happen with timescaleDB?

What does by_range('Dt', INTERVAL '1 year') means? If I understand correctly there should be 2 chunks for 2022 and 2023 years, isn't it?

nikkhils commented 5 days ago

@Bykiev we don't support Calendar based partitioning yet. So, when the first value is added to a chunk it will have that value "X" as the starting value and "X + 1 year" as the end value.

Btw, it seems, things are working here as intended. Please feel free to close this issue in that case. Thanks.