trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.48k stars 3.02k forks source link

Add the functionality of the Iceberg `rewrite_manifests` procedure (e.g. in OPTIMIZE) #14821

Open alexjo2144 opened 2 years ago

alexjo2144 commented 2 years ago

Relates to: https://github.com/trinodb/trino/issues/9340 The Spark implementation is documented here.

When using the append operation rewriting manifests is done automatically at a set size defined by commit.manifest.min-count-to-merge, defaulting to 100. However, if write latency is important, a user may want to skip the automatic compaction and run it async to the writers.

This may be done as a separate procedure, or as a part of the OPTIMIZE command

findepi commented 2 years ago

The optimize should do this, I'm not yet convince we need a separate procedure

alexjo2144 commented 2 years ago

The optimize should do this, I'm not yet convince we need a separate procedure

I'm not sure yet either. Updated the description.

findinpath commented 1 year ago

When a table’s write pattern doesn’t align with the query pattern, metadata can be rewritten to re-group data files into manifests

Taken from Iceberg Spark Procedures Docs

Here is a relative lengthy article about Iceberg which includes the reasoning behind using rewrite_manifests

https://blog.developer.adobe.com/taking-query-optimizations-to-the-next-level-with-iceberg-6c968b83cd6f

A key metric is to keep track of the count of manifests per partition.

The health of the dataset would be tracked based on how many partitions cross a pre-configured threshold of acceptable value of these metrics. The trigger for manifest rewrite can express the severity of the unhealthiness based on these metrics.

We rewrote the manifests by shuffling them across manifests based on a target manifest size. Here is a plot of one such rewrite with the same target manifest size of 8MB. Notice that any day partition spans a maximum of 4 manifests.

  • Before a partition used to span on up to 300 manifests.

I'm not yet convince we need a separate procedure

On the light of the above arguments, I'm inclined to say that this metadata related functionality would need an own procedure, instead of squeezing it under OPTIMIZE.