tabular-io / iceberg-kafka-connect

Apache License 2.0
203 stars 46 forks source link

Iceburg Exactly once with PK #161

Closed pixie79 closed 10 months ago

pixie79 commented 10 months ago

Hi,

With the exactly once symantics can we de duplicate or allow updates to data?

We have a kafka topic which has a top level field of metadata with a sub level field of updated_at and message_key between them this would allow us to know the order records have been created in and also deduplicate on the write to the iceberg table. The message_key should in most cases also be the primary key. Is there a way to handle this?

Thanks

danielcweeks commented 10 months ago

We might be mixing up concepts here. "Exactly once semantics" generally relates to the guarantees about the delivery of messages as opposed to internal value/meaning of the data.

I think what you're probably talking about is "upsert" behavior where you can define field ids that denote a unique record: iceberg.tables.default-id-columns combined with iceberg.tables.upsert-mode-enabled might work depending on your use case.

pixie79 commented 10 months ago

Ah, yes upsert would be probably what I am after. Do you know if that will work with an id-column of metadata.message_key ?Thanks On Tue, Nov 21, 2023 at 18:52, Daniel Weeks @.***> wrote:
We might be mixing up concepts here. "Exactly once semantics" generally relates to the guarantees about the delivery of messages as opposed to internal value/meaning of the data. I think what you're probably talking about is "upsert" behavior where you can define field ids that denote a unique record: iceberg.tables.default-id-columns combined with iceberg.tables.upsert-mode-enabled might work depending on your use case.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

pixie79 commented 10 months ago

Can I also ask what happens if the records from upsert are processed out of order? Are you using a specific field to define the processing order or can we set that to a given nested field like metadata.updated_at ?

bryanck commented 10 months ago

All Kafka records for a given primary key should be placed in the same topic partition to ensure strong ordering guarantees. If the data in Kafka is out of order, then you will need to land the data first as append-only, and then perform a dedupe on that data.