trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.3k stars 2.97k forks source link

Incorrect aggregation pushdown for case insensitive columns in JDBC some connectors #7320

Closed kokosing closed 3 years ago

kokosing commented 3 years ago

In MySql:

trino:tpch> select c from x;
 c
---
 a
 A
 b
 B
(4 rows)

trino:tpch> select distinct c from x;
 c
---
 a
 b
(2 rows)

Possibly affects SQL Server and MemSql too.

kokosing commented 3 years ago

Another example

trino:tpch> select count(*) from x group by c;
 _col0
-------
     2
     2
(2 rows)

trino:tpch> select count(*) from x where rand() != 52 group by c;
 _col0
-------
     1
     1
     1
     1
(4 rows)
hashhar commented 3 years ago

@kokosing @wendigo This is a bit challenging to do correctly.

The easiest case is if the grouping sets contain a text type then the connector can decide to prevent pushdown. This takes care of things like group by or distinct.

For cases with a global aggregation (e.g. max(c) or count(c)) we might need to add something to AggregateFunction to identify if the function is case-sensitive or not. e.g. max is case-sensitive but count is not.

I've implemented a fix for the grouping sets case but for global aggregation I'm not sure about how we want to tackle this. It feels a bit "dirty" to add the concept of case-sensitivity vs not to engine representation of functions.

Note that I can blanket disallow aggregation pushdown if the input to an aggregation function is a text type but that looks more heavy handed than necessary? Or should we tackle it as a follow-up?

Another alternative is to add this logic to the aggregation function rewrite rules individually (e.g. ImplementMinMax)

kokosing commented 3 years ago

if the function is case-sensitive or not.

I was under impression it is about argument type. Can make a condition on information is type is case sensitive or not? Is io.trino.plugin.jdbc.JdbcTypeHandle#caseSensitivity hold an information here?

hashhar commented 3 years ago

Argument type alone doesn't mean pushdown will result in incorrect results.

e.g. SELECT count(varchar) is safe to pushdown while SELECT max(varchar) is not.