redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.46k stars 579 forks source link

iceberg: add merge append action #23497

Closed andrwng closed 18 hours ago

andrwng commented 2 days ago

Implements a new merge_append action that adds a given group of data files to the table, potentially merging them to an existing manifest if there are too many manifests.

This resembles the implementation in the pyiceberg library[1]: we bin-pack manifest_files into groups of 8MiB and merge all bins but the latest one. If the latest bin contains fewer than 100 manifests, it is left alone, otherwise it is also merged. In the future, these will be configurable, but for now, this commit adds these default parameters.

Unlike the python implementation, which groups together merges based on partition spec, this implementation throws if there is more than one partition spec. This is left as short-term future work.

[1] https://github.com/apache/iceberg-python/blob/e5a58b34dd830c6ffea11649613b693f70f7cbb4/pyiceberg/table/update/snapshot.py#L475

Backports Required

Release Notes

vbotbuildovich commented 1 day ago

new failures in https://buildkite.com/redpanda/redpanda/builds/55286#01923283-53ab-4e3d-acab-0d32016940bf:

"rptest.tests.controller_log_limiting_test.ControllerLogLimitMirrorMakerTests.test_mirror_maker_with_limits"

new failures in https://buildkite.com/redpanda/redpanda/builds/55286#01923400-567f-4473-b6c7-3208e5725f23:

"rptest.tests.partition_force_reconfiguration_test.PartitionForceReconfigurationTest.test_basic_reconfiguration.acks=-1.restart=False.controller_snapshots=False"