Open jackye1995 opened 1 year ago
cc @bitsondatadev @alexjo2144 @findepi @ebyhr @JunhyungSong @pettyjamesm @electrum
This issue is currently expressed in terms of similarity metric to Spark implementation. Note that Trino and Spark do not have common execution engine and therefore may have different optimization capabilities or goals. "Similarity to Spark" is simply not a metric we optimize our design for.
Can you reformulate the issue in a way emphasizing how Trino users would benefit from the proposed changes?
Your right that the design should consider Trino implementation details, but user experience should be able to be compared in terms of feature parity with Spark. Since Trino now supports a large deal of the batch workloads that Spark can handle making as much feature parity as possible will provide a better migration experience.
That said, it's clear that not everything can be translated as they are fundamentally different engines. I do agree we need a more Trino specific coloring of the plan but starting from "this is what Spark does" isn't a bad way to begin the conversation.
@bitsondatadev "feature parity with Spark" has never been a Trino goal. For example, we concluded that we're not going to support EXPLODE
or INSERT OVERWRITE
syntaxes even though Spark has them.
"Spark has it" is not enough rationale to add something to Trino (I guess vice-versa is also true). We want to add stuff that matters for our users. I think "user experience" is the key -- so let's redefine the issue in "user experience terms".
I agree that "Spark has it" is not enough rationale. But that doesn't mean we can't model off of their UX if it meets Trino users needs. It can be a basis to discuss if a feature is worth adding and then we'll dive into the how.
But that doesn't mean we can't model off of their UX if it meets Trino users needs.
Agreed. But I still don't know what problem we're trying to solve, so I cannot evaluate whether Spark's approach fits Trino users needs. I don't see a need to continue discussion about whether we need a clear issue description. I will leave it up to @jackye1995 to provide one.
I updated the title to target specific features. I think at least exposing configs similar to min-input-files
, delete-file-threshold
would benefit users.
It supports a use pattern where they specify the specific requirements for a table that requires compaction, and then they can just keep running OPTIMIZE without the need to check the preconditions beforehand.
As far as delete-file-threshold
goes (found the details of what that does in the code comment here) if I remember correctly we went with a slightly different approach with compaction in Trino. The optimize
procedure should be compacting any data files that have any deletes associated with them, effectively that delete-file-threshold
value is hard-coded to 1.
Configurations in Spark
RewriteDataFiles
procedure could be supported in Trino, likemin-input-files
,delete-file-threshold
, etc.https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/actions/BinPackStrategy.java#L58
This could be done by checking Iceberg snapshot summary to determine if compaction is needed or not, and have more logic to determine which file scan task to include based on size boundary.