trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.47k stars 3.01k forks source link

BETWEEN results in full table scan for TIMESTAMP columns of differing lengths #23428

Open huw0 opened 2 months ago

huw0 commented 2 months ago

Summary - Queries using BETWEEN result in entire table scans when the timestamp filter is a greater length than the column in the underlying table.


This example shows how to reproduce using PostgreSQL, however I believe this impacts all connectors.

Example database:

CREATE TABLE pgsql.public.test(
  id INTEGER,
  stamp timestamp(0)
);

Example data:

INSERT INTO pgsql.public.test VALUES(1, timestamp '2020-01-01');
INSERT INTO pgsql.public.test VALUES(2, timestamp '2020-01-02');
INSERT INTO pgsql.public.test VALUES(3, timestamp '2030-01-02');

The following queries all work successfully and filter logic is correctly pushed down to PostgreSQL. Only two rows are returned to Trino.

SELECT * FROM pgsql.public.test WHERE stamp < timestamp '2025-01-01';
SELECT * FROM pgsql.public.test WHERE stamp < timestamp '2025-01-01 00:00:00.000';

SELECT * FROM pgsql.public.test WHERE stamp BETWEEN timestamp '2019-01-01' and timestamp '2025-01-01';
SELECT * FROM pgsql.public.test WHERE stamp BETWEEN timestamp '2019-01-01 00:00:00' and timestamp '2025-01-01 00:00:00';

SELECT * FROM pgsql.public.test WHERE stamp > timestamp '2019-01-01 00:00:00' and stamp < timestamp '2025-01-01 00:00:00';
SELECT * FROM pgsql.public.test WHERE stamp >= timestamp '2019-01-01 00:00:00' and stamp <= timestamp '2025-01-01 00:00:00';

SELECT * FROM pgsql.public.test WHERE stamp > timestamp '2019-01-01 00:00:00.000' and stamp < timestamp '2025-01-01 00:00:00.000';
SELECT * FROM pgsql.public.test WHERE stamp >= timestamp '2019-01-01 00:00:00.000' and stamp <= timestamp '2025-01-01 00:00:00.000';

However taking the BETWEEN equivalent for the final query:

SELECT * FROM pgsql.public.test WHERE stamp between timestamp '2019-01-01 00:00:00.000' and timestamp '2025-01-01 00:00:00.000';

This does a full scan of the table which on large tables are substantially slower.

My expectation is that these final two queries should result in identical query plans.

wendigo commented 2 months ago

@raunaqmorarka PTAL

martint commented 1 month ago

This may be related to the inability to push down certain expression into the connector due to intervening implicit casts.

@huw0, can you provide the result of EXPLAIN ANALYZE for one of the queries that works as you expect and the one that doesn't?

huw0 commented 1 month ago

I don't have the explain output handy right now, but it is definitely cast related.

I think the problem is that the casts are removed by the UnwrapCastInComparison rule.

However this rule is never hit because DesugarBetween won't convert the query to >= and <= when the query contains a cast.

Adding an additional trivial switch case to DesugarBetween when the input is a cast solves the problem. However a correct implementation probably needs to be more elegant to confirm that the types are valid for conversion to >= and <=.

As pushdown works when the query is manually converted to WHERE >= x and <= y I think this is definitely a missing condition in the optimiser.