memiiso / debezium-server-iceberg

Replicates any database (CDC events) to Apache Iceberg (To Cloud Storage)
Apache License 2.0
174 stars 35 forks source link

Support (or document) Azure Storage as sink #222

Open karlschriek opened 11 months ago

karlschriek commented 11 months ago

I am trying so set up a very simple process to stream CDC records from (Azure) SQL Server as Iceberg Tables to an Azure Storage Account. I've come across various potential solutions to do this, most of which involve chaining various tools together (and using Event Hub at some point).

I find would like to be able to go [SQL Server] ----cdc message----> [Debezium Server] ----iceberg table----> [Azure Blob Storage] instead. Is this possible as of today? If so could we document it somewhere? If not, could we support this?

ismailsimsek commented 11 months ago

Hi @karlschriek yes this should be possible with current release(supported). it should be mater of configuring Hadoop FileIO to write Azure Blob and configuring Iceberg Catalog to use azure hive server.

karlschriek commented 11 months ago

Ok, that sounds promising. Are there any docs anywhere on how to do something like that? Right now this is the only example config I am able to reference, which is very S3-specific:

https://github.com/memiiso/debezium-server-iceberg/blob/3f0649ae880e9bedd2bdff9e43ca5601bda3da0d/debezium-server-iceberg-sink/src/main/resources/conf/application.properties.example

karlschriek commented 11 months ago

Hmmm, as far as I can see there are currently two unmerged PRs open that would add ADLS as FileIO, so doesn't look like it is actually supported right now:

ismailsimsek commented 11 months ago

it is supported with Hadoop file io, i believe this prs are adding more direct Azure Storage integration(Without Hadoop libraries)

Currently, HadoopFileIO is used to talk to azure blob storage.

ghost commented 10 months ago

@karlschriek Have you been able to get this up and running with Azure Blob Storage?

@ismailsimsek can you point me to some documentation to help me to get this working on Azure Blob?

ismailsimsek commented 10 months ago

could you try this options https://learn.microsoft.com/en-us/azure/databricks/storage/azure-storage adding debezium.sink.iceberg. as prefix.

it will also require hadoop azure library if its not included currently https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-azure/3.3.6

ismailsimsek commented 10 months ago

related to https://github.com/apache/iceberg/issues/8662

ghost commented 10 months ago

Thanks, will give this a try if I have my setup in docker with sqlserver up and running.

ismailsimsek commented 2 months ago

leaving example here: https://github.com/tabular-io/iceberg-kafka-connect?tab=readme-ov-file#azure-adls-configuration-example