timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.46k stars 875 forks source link

[Bug] Query optimization: time-series table with +10M rows #6129

Open Pierre-Alexandre35 opened 11 months ago

Pierre-Alexandre35 commented 11 months ago

What type of bug is this?

Performance issue

What subsystems and features are affected?

Query executor

What happened?

Hello,

I have a table conditions with 10.5 millions rows. Those lines are coming from an IoT device that create data every 3 seconds (so 1 year is around 10.5M lines).

I am running a simple query toSELECT *and the performance seems decent (around 1 second). However, when it comes to only select a definite sample (1 row per day --> 365 rows), it's taking more than 30 seconds. Not quite sure if this should be the normal behaviour in term of performance. This is the script used to create my schema:

CREATE TABLE
  public.conditions(
    "time" timestamp with time zone NOT NULL,
    i1 real NOT NULL,
    i2 real NOT NULL,
    i3 real NOT NULL,
    i4 real NOT NULL,
    i5 real NOT NULL,
    i6 real NOT NULL,
    i7 real NOT NULL,
    i8 real NOT NULL,
    i10 real NOT NULL,
    i11 real NOT NULL
  );

  SELECT create_hypertable(
  'public.conditions',
  'time',
  chunk_time_interval => INTERVAL '1 day'
);

I have some performance issues to retrieve my data for more "complex" queries:

simple SELECT:

query:

EXPLAIN ANALYZE
SELECT
  *
FROM
  conditions;

result:

-----------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.00..256229.97 rows=10511998 width=48) (actual time=0.038..940.653 rows=10511998 loops=1)
   ->  Seq Scan on _hyper_1_1_chunk  (cost=0.00..557.98 rows=28798 width=48) (actual time=0.035..3.801 rows=28798 loops=1)
   ->  Seq Scan on _hyper_1_2_chunk  (cost=0.00..558.00 rows=28800 width=48) (actual time=0.018..2.568 rows=28800 loops=1)
   ->  Seq Scan on _hyper_1_3_chunk  (cost=0.00..558.00 rows=28800 width=48) (actual time=0.005..2.791 rows=28800 loops=1)
   ->  Seq Scan on _hyper_1_4_chunk  (cost=0.00..558.00 rows=28800 width=48) (actual time=0.007..2.472 rows=28800 loops=1)
   ->  Seq Scan on _hyper_1_5_chunk  (cost=0.00..558.00 rows=28800 width=48) (actual time=0.007..2.629 rows=28800 loops=1)
   ->  Seq Scan on _hyper_1_6_chunk  (cost=0.00..558.00 rows=28800 width=48) (actual time=0.006..2.249 rows=28800 loops=1)
   ->  Seq Scan on _hyper_1_7_chunk  (cost=0.00..558.00 rows=28800 width=48) (actual time=0.005..1.950 rows=28800 loops=1)
   ->  Seq Scan on _hyper_1_8_chunk  (cost=0.00..558.00 rows=28800 width=48) (actual time=0.006..1.873 rows=28800 loops=1)
tual time=0.028..1.828 rows=28800 loops=1)
ctual time=0.005..1.673 rows=28800 loops=1)

More complex SELECT:

query (only display 1 row per day - since

EXPLAIN ANALYZE
SELECT
  DISTINCT ON (DATE_TRUNC('day', time)) *
FROM
  conditions
WHERE
  DATE_TRUNC('day', time) BETWEEN '2022-01-01'
  AND '2022-12-31'
ORDER BY
  DATE_TRUNC('day', time),
  time;

result:

...
                           Filter: ((date_trunc('day'::text, "time") >= '2022-01-01 00:00:00+00'::timestamp with time zone) AND (date_trunc('day'::text, "time") <= '2022-12-31 00:00:00+00'::timestamp with time zone))
                     ->  Parallel Seq Scan on _hyper_1_364_chunk  (cost=0.00..608.82 rows=85 width=48) (actual time=0.024..5.996 rows=28800 loops=1)
                           Filter: ((date_trunc('day'::text, "time") >= '2022-01-01 00:00:00+00'::timestamp with time zone) AND (date_trunc('day'::text, "time") <= '2022-12-31 00:00:00+00'::timestamp with time zone))
                     ->  Parallel Seq Scan on _hyper_1_365_chunk  (cost=0.00..608.82 rows=85 width=48) (actual time=0.013..3.150 rows=14400 loops=2)
                           Filter: ((date_trunc('day'::text, "time") >= '2022-01-01 00:00:00+00'::timestamp with time zone) AND (date_trunc('day'::text, "time") <= '2022-12-31 00:00:00+00'::timestamp with time zone))
                     ->  Parallel Seq Scan on _hyper_1_1_chunk  (cost=0.00..608.80 rows=85 width=48) (actual time=0.020..1.652 rows=7200 loops=4)
                           Filter: ((date_trunc('day'::text, "time") >= '2022-01-01 00:00:00+00'::timestamp with time zone) AND (date_trunc('day'::text, "time") <= '2022-12-31 00:00:00+00'::timestamp with time zone))
 Planning Time: 40.823 ms
 JIT:
   Functions: 2929
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 167.781 ms, Inlining 0.000 ms, Optimization 31.081 ms, Emission 735.023 ms, Total 933.885 ms
 Execution Time: 33151.422 ms
(745 rows)

I have also tried this with a little improvement but still around 25 seconds:

SELECT DISTINCT ON (time_bucket('1 day', "time")) 
  time_bucket('1 day', "time") as day,
  i1,
  i2,
  i3,
  i4,
  i5,
  i6,
  i7,
  i8,
  i10,
  i11
FROM
  public.conditions
ORDER BY
  time_bucket('1 day', "time"), "time";

TimescaleDB version affected

2.5

PostgreSQL version used

14

What operating system did you use?

Apple M2 Pro

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

.
antekresic commented 11 months ago

Hi,

Looks like you might be hitting the JIT performance regression issue here which is a more general PG problem with partitioned tables. We have an outstanding issue about investigating if we can do something about this problem.

In the meantime, disabling JIT for these types of queries should help here. You can do that by disabling it for the role you are using to connect like this:

alter role mytestrole set jit=off;

Note: you will have to reconnect for this to take affect.

Let me know if this helps with the performance.

alex-tate commented 11 months ago

I haven't got anything to add about the JIT comment, but the slow-runing query presented could be tackled with an AsOf Join. This is currently the most voted TimescaleDB feature request (see #271) but for now you could use a generated timeseries and a lateral join which would speed up the specfic query example given. Something similar to...

SELECT cond.*
from (select * from generate_series('2022-01-01 00:00:00'::timestamptz, '2022-12-31 00:00:00'::timestamptz, '1 day'::interval)) as gs(x)
cross join lateral 
(select * from conditions
    where conditions.time >= gs.x
    order by conditions.time asc limit 1) as cond;

I tested on a one second sensor table (30M rows per year) that I have and got a 70s run time on the originl query versus 100ms run time for the lateral join method.