Support data and delete file thresholds for Iceberg OPTIMIZE

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

https://trino.io

Apache License 2.0

10.49k stars 3.02k forks source link

Support data and delete file thresholds for Iceberg OPTIMIZE #16574

Open jackye1995 opened 1 year ago

jackye1995 commented 1 year ago

Configurations in Spark RewriteDataFiles procedure could be supported in Trino, like min-input-files, delete-file-threshold, etc.

https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java#L58

This could be done by checking Iceberg snapshot summary to determine if compaction is needed or not, and have more logic to determine which file scan task to include based on size boundary.

jackye1995 commented 1 year ago

cc @bitsondatadev @alexjo2144 @findepi @ebyhr @JunhyungSong @pettyjamesm @electrum

findepi commented 1 year ago

This issue is currently expressed in terms of similarity metric to Spark implementation. Note that Trino and Spark do not have common execution engine and therefore may have different optimization capabilities or goals. "Similarity to Spark" is simply not a metric we optimize our design for.

Can you reformulate the issue in a way emphasizing how Trino users would benefit from the proposed changes?

bitsondatadev commented 1 year ago

Your right that the design should consider Trino implementation details, but user experience should be able to be compared in terms of feature parity with Spark. Since Trino now supports a large deal of the batch workloads that Spark can handle making as much feature parity as possible will provide a better migration experience.

That said, it's clear that not everything can be translated as they are fundamentally different engines. I do agree we need a more Trino specific coloring of the plan but starting from "this is what Spark does" isn't a bad way to begin the conversation.

findepi commented 1 year ago

@bitsondatadev "feature parity with Spark" has never been a Trino goal. For example, we concluded that we're not going to support EXPLODE or INSERT OVERWRITE syntaxes even though Spark has them.

"Spark has it" is not enough rationale to add something to Trino (I guess vice-versa is also true). We want to add stuff that matters for our users. I think "user experience" is the key -- so let's redefine the issue in "user experience terms".

bitsondatadev commented 1 year ago

I agree that "Spark has it" is not enough rationale. But that doesn't mean we can't model off of their UX if it meets Trino users needs. It can be a basis to discuss if a feature is worth adding and then we'll dive into the how.

findepi commented 1 year ago

But that doesn't mean we can't model off of their UX if it meets Trino users needs.

Agreed. But I still don't know what problem we're trying to solve, so I cannot evaluate whether Spark's approach fits Trino users needs. I don't see a need to continue discussion about whether we need a clear issue description. I will leave it up to @jackye1995 to provide one.

jackye1995 commented 1 year ago

I updated the title to target specific features. I think at least exposing configs similar to min-input-files, delete-file-threshold would benefit users.

It supports a use pattern where they specify the specific requirements for a table that requires compaction, and then they can just keep running OPTIMIZE without the need to check the preconditions beforehand.

alexjo2144 commented 1 year ago

As far as delete-file-threshold goes (found the details of what that does in the code comment here) if I remember correctly we went with a slightly different approach with compaction in Trino. The optimize procedure should be compacting any data files that have any deletes associated with them, effectively that delete-file-threshold value is hard-coded to 1.