This PR allows for deduplication of data files by adding a couple pieces of metadata to the system:
each data file pending commit in the coordinator state machine now keeps track of the offset at which it was added to the state machine
each Iceberg snapshot now includes a property indicating the coordinator partition offset of the highest data file committed in the snapshot
With these two pieces of metadata, the Iceberg file committer can skip over adding any files that have already been committed to the Iceberg table.
This is very similar to the deduplication described by the Tabular connector[1] (which is very similar to the Apache Iceberg connector[2]), though the connectors store table metadata across multiple partitions of the control topic and store multiple offsets accordingly, while in this PR we only store a single offset since each table is managed by a single partition.
This PR allows for deduplication of data files by adding a couple pieces of metadata to the system:
With these two pieces of metadata, the Iceberg file committer can skip over adding any files that have already been committed to the Iceberg table.
This is very similar to the deduplication described by the Tabular connector[1] (which is very similar to the Apache Iceberg connector[2]), though the connectors store table metadata across multiple partitions of the control topic and store multiple offsets accordingly, while in this PR we only store a single offset since each table is managed by a single partition.
[1] https://github.com/databricks/iceberg-kafka-connect/blob/066bb7993de1d2e8edfd6302d16cb50e52df5f19/docs/design.md#delivery-semantics [2] https://github.com/apache/iceberg/tree/main/kafka-connect
Backports Required
Release Notes
none
the below tests from https://buildkite.com/redpanda/redpanda/builds/58002#01932493-2b81-4f23-9a63-639880e3caf0 have failed and will be retried