Open vburenin opened 11 months ago
Predicate pushdown generally requires that the partition column be used directly in the comparison, not inside a function. Try changing your query to something like this:
WHERE day_partition >= CAST(date_add('day', -30, CURRENT_DATE) AS varchar)
Predicate pushdown generally requires that the partition column be used directly in the comparison, not inside a function.
For this case, there's an optimization that unwraps the column reference from inside the date(...)
expression, which is really just syntactic sugar for a cast from varchar to date. See https://trino.io/blog/2019/05/21/optimizing-the-casts-away.html.
This issue appears like a regression as it worked in Trino 419. While changing the SQL query is not a big of an issue, changing a few thousands of them scattered around userbase might pose a serious challenge.
@vburenin, are you able to pinpoint exactly which version introduced the problem?
@martint Unfortunately no. I jumped from 419 to 426 and spotted the problem only yesterdays when I added my own custom metadata handling into hudi connector.
For this case, there's an optimization that unwraps the column reference from inside the
date(...)
expression, which is really just syntactic sugar for a cast from varchar to date. See https://trino.io/blog/2019/05/21/optimizing-the-casts-away.html.
The blog post is awesome. Note that it predates the optimization that you're observing at work. It was implemented in https://github.com/trinodb/trino/pull/13567, so Trino 394. I am not aware of any changes in this area between 419 and 426.
when I added my own custom metadata handling into hudi connector.
@vburenin i think we can consider the issue actionable, if there is a problem with vanilla Trino, without any custom modifications.
Are you seeing a difference on vanilla Trino Iceberg connector between version 419 and 426?
If so, can you please share output of EXPLAIN SELECT .. FROM your_table DATE (day_partition) >= Date_add('day', - 30, CURRENT_DATE)
from both these versions?
I am planning to poke around to pinpoint a version during the day today CT. I wish I was familiar with Trino code base to debug it deeper.
On Wed, Sep 20, 2023 at 3:35 AM Piotr Findeisen @.***> wrote:
For this case, there's an optimization that unwraps the column reference from inside the date(...) expression, which is really just syntactic sugar for a cast from varchar to date. See https://trino.io/blog/2019/05/21/optimizing-the-casts-away.html.
The blog post is awesome. Note that it predates the optimization that you're observing at work. It was implemented in #13567 https://github.com/trinodb/trino/pull/13567, so Trino 394. I am not aware of any changes in this area between 419 and 426.
when I added my own custom metadata handling into hudi connector.
@vburenin https://github.com/vburenin i think we can consider the issue actionable, if there is a problem with vanilla Trino, without any custom modifications. Are you seeing a difference on vanilla Trino Iceberg connector between version 419 and 426? If so, can you please share output of EXPLAIN SELECT .. FROM your_table DATE (day_partition) >= Date_add('day', - 30, CURRENT_DATE) from both these versions?
— Reply to this email directly, view it on GitHub https://github.com/trinodb/trino/issues/19097#issuecomment-1727238888, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBGHKZ7V3GDWMSUBWBK263X3KTGJANCNFSM6AAAAAA462VSBM . You are receiving this because you were mentioned.Message ID: @.***>
I am certainly mistaken on the version it is introduced, I have dug out a symptom in a combination of hive(to read hudi, i had to mode hive connector)+google sheets connector that lead me to the reproduction of the issue in iceberg, but that is apparently not the root cause. I will update the bug description once I narrow my search down more.
Here is more info from the SQL explain that lead me to believe it is a problem of the predicated:
explain SELECT object_id FROM hudi_data.vburenin_dev.test_hudi WHERE date(day_partition) >= date_add('day', - 30, CURRENT_DATE) group by 1;
On 419 with hive connector and tiny mod to read hudi table:
│ day_partition := day_partition:string:PARTITION_KEY >
│ :: [[2023-08-21], [2023-08-22], [2023-08-23], [2023-08-24], [2023-08-25], [2023-08-26], [2023-08-27], [2023-08-28], [2023-08-29], >
│ created_at := created_at:bigint:REGULAR
On 426:
│ day_partition := day_partition:string:PARTITION_KEY
@vburenin yes this looks like lack of partition pruning for day_partition
in 426.
unfortunately, i don't know much about Hudi. Are you able to repro the problem with Iceberg, Hive or Delta? Then perhaps I might be able to help
if not, let me cc someone more knowledgable about Hudi -- @codope
@findepi on the exactly the same partitioned table, basically a hudi table copy into Iceberg EXPLAIN doesn't look promising:
day_partition := 325:day_partition:varchar >
:: [(<min>, 1), (2022, <max>)] >
Tested on 419 and 426. It appears like Hive connector is better equipped with partition pruning.
UPDATED: I can't use Hive connector anymore to query hudi tables on Trino 426
Updated problem statement.
It appears that there is a big difference in how hive, hudi and iceberg optimizers work. Running this query on the same table represented as hudi table and iceberg table gives very different optimizer output. Tables are partitioned by day_partition field that has a format 2023-09-15.
On 419 with hive connector and tiny mod to read hudi tables:
On 426 with Hudi connector:
On Iceberg for 419 and 426:
This particular behavior greatly affects the cost at the large scale due to more of LIST operations for Hudi and more of GetObject calls.
----------------------------------------------------
Original problem statement: I have a bunch of hudi and iceberg tables partitioned by varchar day_partition field in the following format: 2023-09-19 When I supply this WHERE condition: DATE (day_partition) >= Date_add('day', - 30, CURRENT_DATE) It appears that the filtering predicate is calculated incorrectly as it looks like this:
day_partition := 21:day_partition:varchar :: [(, 1), (2022, )]
The issue happened after Trino 419, I have run my tests on Trino 426.
Initially I spotted it when I noticed that Hudi connector is looking up too many partitions even though they are way out of scope.