trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.47k stars 3.01k forks source link

Planning with long IN lists can be very slow #5950

Open sopel39 opened 4 years ago

sopel39 commented 4 years ago

For example:

    @Test
    public void testExtremelyLargeIn()
    {
        // query should not fail
        String longValues = range(0, 100000)
                .mapToObj(Integer::toString)
                .collect(joining(", "));
        computeActual("SELECT orderkey FROM orders WHERE orderkey IN (" + longValues + ")");
    }

planning takes 6m. This is mostly because we run: typeAnalyzer.getTypes(session, symbolAllocator.getTypes(), expression) in optimizer in various places (mostly from io.prestosql.sql.planner.TypeAnalyzer#getTypes). Currently, during planning we need to re-analyze expression types since type information is not part of plan itself.

ByPatrick commented 3 years ago

Hi all, I find the same issue in Postgresql connector: if the item count of IN clause list > 32, trino will take very long on schedule, but if the item count <= 32, the query is fast.

findepi commented 3 years ago

@ByPatrick what is planning time with IN list of 32 and what is planning time with IN list of 33? also, what is your query?

ByPatrick commented 3 years ago

@findepi

My SQL is like this:

SELECT pkey, column1, column3 FROM schema.table WHERE column2 <> 'somevalue1' AND pkey IN ('XXXXXXX', 'XXXXXXX', ... , 'XXXXXXX')

The 'XXXXXX' is a string like 'AB1234567890'

The EXPLAIN ANALYZE of this SQL is below.

// IN clause item count = 32 Fragment 1 [SOURCE] CPU: 3.56ms, Scheduled: 23.92ms, Input: 0 rows (0B); per task: avg.: 0.00 std.dev.: 0.00, Output: 0 rows (0B) Output layout: [pkey, column1, column3] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION ScanFilterProject[table = postgresql:schema.table schema.table constraint on [pkey] columns=[pkey:varchar:varchar, column1:varchar:varchar, column2:varchar:varchar, column3:varchar:varchar], grouped = false, filterPredicate = ("column2" <> CAST('somevalue1' AS varchar))] Layout: [pkey:varchar, column1:varchar, column3:varchar] Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B} CPU: 3.00ms (100.00%), Scheduled: 23.00ms (100.00%), Output: 0 rows (0B) Input avg.: 0.00 rows, Input std.dev.: ?% column2 := column2:varchar:varchar column3 := column3:varchar:varchar column1 := column1:varchar:varchar pkey := pkey:varchar:varchar Input: 0 rows (0B), Filtered: ?%

// IN clause item count = 33 Fragment 1 [SOURCE] CPU: 10.44s, Scheduled: 48.98s, Input: 8038730 rows (0B); per task: avg.: 8038730.00 std.dev.: 0.00, Output: 0 rows (0B) Output layout: [pkey, column1, column3] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION ScanFilterProject[table = postgresql:schema.table schema.table constraint on [pkey] columns=[pkey:varchar:varchar, column1:varchar:varchar, column2:varchar:varchar, column3:varchar:varchar], grouped = false, filterPredicate = (("column2" <> CAST('somevalue1' AS varchar)) AND ("pkey" IN (CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar), CAST('XXXXXXX') AS varchar))))] Layout: [pkey:varchar, column1:varchar, column3:varchar] Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B}/{rows: ? (?), cpu: ?, memory: 0B, network: 0B} CPU: 10.44s (100.00%), Scheduled: 48.98s (100.00%), Output: 0 rows (0B) Input avg.: 8038730.00 rows, Input std.dev.: 0.00% column2 := column2:varchar:varchar column3 := column3:varchar:varchar column1 := column1:varchar:varchar pkey := pkey:varchar:varchar Input: 8038730 rows (0B), Filtered: 100.00%

BTW, I'm using Trino:358

findepi commented 3 years ago

CPU: 10.44s is reasonable for evaluating the filter on 8M rows the fact the second query needs more execution time (49s?) is because for >32, the domain filter is not eliminated, and is not fully pushed into postgresql -- we read many rows from remote system.

Can you please rerun your second query with domain-compaction-threshold=50 in postgresql.properties?

Also, initially you complained about planning time. What is the planning time for both queries?

ByPatrick commented 3 years ago

@findepi Sorry, it's my bad, I didn't figure out this original post's point, now I'm clear. My query is slow when scheduling, not planning. The both query's planning time are 200~300ms, it's fast.

And set domain-compaction-threshold=50 make the 33 query works like 32's, it's perfect, thank you very much!