redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.65k stars 589 forks source link

datalake/coordinator: deduplicate offset ranges already in Iceberg #24111

Closed andrwng closed 2 days ago

andrwng commented 3 days ago

This PR allows for deduplication of data files by adding a couple pieces of metadata to the system:

With these two pieces of metadata, the Iceberg file committer can skip over adding any files that have already been committed to the Iceberg table.

This is very similar to the deduplication described by the Tabular connector[1] (which is very similar to the Apache Iceberg connector[2]), though the connectors store table metadata across multiple partitions of the control topic and store multiple offsets accordingly, while in this PR we only store a single offset since each table is managed by a single partition.

[1] https://github.com/databricks/iceberg-kafka-connect/blob/066bb7993de1d2e8edfd6302d16cb50e52df5f19/docs/design.md#delivery-semantics [2] https://github.com/apache/iceberg/tree/main/kafka-connect

Backports Required

Release Notes

vbotbuildovich commented 3 days ago

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/58002#019324d3-cf8b-4026-8c4c-a44fe45b0943