Open alexjo2144 opened 2 years ago
The optimize should do this, I'm not yet convince we need a separate procedure
The optimize should do this, I'm not yet convince we need a separate procedure
I'm not sure yet either. Updated the description.
When a table’s write pattern doesn’t align with the query pattern, metadata can be rewritten to re-group data files into manifests
Taken from Iceberg Spark Procedures Docs
Here is a relative lengthy article about Iceberg which includes the reasoning behind using rewrite_manifests
A key metric is to keep track of the count of manifests per partition.
The health of the dataset would be tracked based on how many partitions cross a pre-configured threshold of acceptable value of these metrics. The trigger for manifest rewrite can express the severity of the unhealthiness based on these metrics.
We rewrote the manifests by shuffling them across manifests based on a target manifest size. Here is a plot of one such rewrite with the same target manifest size of 8MB. Notice that any day partition spans a maximum of 4 manifests.
- Before a partition used to span on up to 300 manifests.
I'm not yet convince we need a separate procedure
On the light of the above arguments, I'm inclined to say that this metadata related functionality would need an own procedure, instead of squeezing it under OPTIMIZE
.
Relates to: https://github.com/trinodb/trino/issues/9340 The Spark implementation is documented here.
When using the
append
operation rewriting manifests is done automatically at a set size defined bycommit.manifest.min-count-to-merge
, defaulting to 100. However, if write latency is important, a user may want to skip the automatic compaction and run it async to the writers.This may be done as a separate procedure, or as a part of the
OPTIMIZE
command