tabular-io / iceberg-kafka-connect

Apache License 2.0
192 stars 41 forks source link

Cost Estimation #86

Closed Wuerike closed 11 months ago

Wuerike commented 11 months ago

yeh, i know, it's impossible and it's not your job to say to me how much I will spend (this is sort of a joke)

I need somehow move data from my topic to S3, and using Iceberg would be awesome because of its capability of upserting and also partitioning the data to me.

So, I'm really thinking about using this connector in my really large project, but i'm too afraid about the costs. As it streams the data to Iceberg, I'll be doing 50 million inserts/upserts per day through this connector.

Does anyone has used it and has some experience to share with me please?

Btw i'll be using Iceberg through Glue/Athena on AWS.

bryanck commented 11 months ago

That doesn’t strike me as very extreme throughput, the sink should be handle that load with a small cluster, depending on any transforms you’re applying. Updates and deletes shouldn’t add too much overhead from the write side. For comparison, we’re using the sink to append well over 1 billion events/day, fanning out to several different tables, on a cluster with only 6 cores total.

Where you may face some challenges is not sink-related, but with read performance with delete files. Performance can degrade over time as delete files accumulate. This is an area that is being actively worked on in Iceberg. You will want to compact your data regularly, which will merge in the deletes, to improve read performance.

Wuerike commented 11 months ago

@bryanck thanks!

It was enough to let me confident on following this approach.