memiiso / debezium-server-iceberg

Replicates any database (CDC events) to Apache Iceberg (To Cloud Storage)
Apache License 2.0
199 stars 36 forks source link

How to push Iceberg data to Azure Blob Storage ? #325

Closed GOVINDARAMTEKKAR97 closed 5 months ago

GOVINDARAMTEKKAR97 commented 6 months ago

Hi @ismailsimsek , hope you are in good health. I am able to push data and metadata that is parquet and json files to Amazon S3 storage. It is working well and able to push data to s3 buckets. Now I want to push data to Azure Blob storage. Can you please guide me which Catalog and parameters to should use, it will help me a lot if you guide.

I have attached application.properties below for your references.

postgres

debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector debezium.source.offset.storage.file.filename=data/offsets.dat debezium.source.offset.flush.interval.ms=0 debezium.source.database.hostname=localhost debezium.source.database.port=5432 debezium.source.database.user=postgres debezium.source.database.password=root@123 debezium.source.database.dbname=dremio debezium.source.topic.prefix=tutorial debezium.source.schema.include.list=public

ENABLE_DEBEZIUM_SCRIPTING=true

debezium.sink.type=iceberg

icebergevents

iceberg

Iceberg sink config

debezium.sink.type=iceberg debezium.sink.iceberg.table-prefix=debeziumcdc_ debezium.sink.iceberg.upsert=true debezium.sink.iceberg.upsert-keep-deletes=false debezium.sink.iceberg.write.format.default=parquet debezium.sink.iceberg.catalog-name=default

enable event schemas - mandatory

debezium.format.value.schemas.enable=true debezium.format.key.schemas.enable=true debezium.format.value=json debezium.format.key=json

SET LOG LEVELS

quarkus.log.level=INFO quarkus.log.console.json=false

hadoop, parquet

quarkus.log.category."org.apache.hadoop".level=WARN quarkus.log.category."org.apache.parquet".level=WARN

Ignore messages below warning level from Jetty, because it's a bit verbose

quarkus.log.category."org.eclipse.jetty".level=WARN

see https://debezium.io/documentation/reference/stable/development/engine.html#advanced-consuming

debezium.source.offset.storage=io.debezium.server.iceberg.offset.IcebergOffsetBackingStore debezium.source.offset.storage.iceberg.table-name=debezium_offset_storage_custom_table

see https://debezium.io/documentation/reference/stable/development/engine.html#database-history-properties

debezium.source.schema.history.internal=io.debezium.server.iceberg.history.IcebergSchemaHistory debezium.source.schema.history.internal.iceberg.table-name=debezium_database_history_storage_test

enable event schemas

debezium.format.value.schemas.enable=true debezium.format.value=json

complex nested data types are not supported, do event flattening. unwrap message!

debezium.transforms=unwrap debezium.transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState debezium.transforms.unwrap.add.fields=op,table,source.ts_ms,db debezium.transforms.unwrap.delete.handling.mode=rewrite debezium.transforms.unwrap.drop.tombstones=true

################## debezium.sink.batch.batch-size-wait=MaxBatchSizeWait debezium.sink.batch.batch-size-wait.max-wait-ms=180000 debezium.sink.batch.batch-size-wait.wait-interval-ms=120000 debezium.sink.batch.metrics.snapshot-mbean=debezium.postgres:type=connector-metrics,context=snapshot,server=testc debezium.sink.batch.metrics.streaming-mbean=debezium.postgres:type=connector-metrics,context=streaming,server=testc

increase max.batch.size to receive large number of events per batch

debezium.source.max.batch.size=15 debezium.source.max.queue.size=45

S3 config without hadoop catalog. Using GlueCatalog catalog And S3FileIO

debezium.sink.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO debezium.sink.iceberg.s3.access-key-id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx debezium.sink.iceberg.s3.secret-access-key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx debezium.sink.iceberg.warehouse=s3://dremio-califonia/iceberg_warehouse debezium.sink.iceberg.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog

Please give us solution or any way so I can accomplished pushing data to Azure Blob Storage.

Regards, Your Faithfully Govinda Ramtekkar

ismailsimsek commented 6 months ago

@GOVINDARAMTEKKAR97 if you configure azurefileio then you can push data to azure.

ismailsimsek commented 6 months ago

previous issue #222

you can find a example here: https://github.com/tabular-io/iceberg-kafka-connect?tab=readme-ov-file#azure-adls-configuration-example

GOVINDARAMTEKKAR97 commented 6 months ago

Hi @ismailsimsek , I am facing many error like rest catalog one of them and hivefivemetatstore and other. One of the following I menioned below. at io.debezium.server.Main.main(Main.java:15) Caused by: java.lang.IllegalArgumentException: Cannot initialize Catalog implementation rest: Cannot find constructor for interface org.apache.iceberg.catalog.Catalog Missing rest [java.lang.ClassNotFoundException: rest]

If you update me application.properties parameter here like aws as mentioned below debezium.sink.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO debezium.sink.iceberg.s3.access-key-id=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx debezium.sink.iceberg.s3.secret-access-key=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx debezium.sink.iceberg.warehouse=s3://dremio-califonia/iceberg_warehouse debezium.sink.iceberg.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog

It will very helpful for me, or you can update application.properties.examples then it will good for other people aslo. adding parameter for Azure blob storage will be help a lot.

ismailsimsek commented 6 months ago

@GOVINDARAMTEKKAR97 i recommend trying it with jdbc catalog, see example below

for azure you can see example config here

for rest catalog you can see example config here

jdbc example https://github.com/memiiso/debezium-server-iceberg/blob/ca67cf4b0a5f3b72a71dbbebbcfd07b8a2b59155/examples/conf/application.properties#L10-L22

GOVINDARAMTEKKAR97 commented 6 months ago

thanks @ismailsimsek for azure you can see example config here above this examples I already try still facing issues image

one of error of screenshot I have attached.

ismailsimsek commented 6 months ago

could you share your config?

GOVINDARAMTEKKAR97 commented 6 months ago

debezium.sink.iceberg.catalog.type=jdbc debezium.sink.iceberg.warehouse=abfss://iceberg@icebergasl.dfs.core.windows.net/warehouse debezium.sink.iceberg.catalog.uri=jdbc:postgresql://localhost:5432/dremio debezium.sink.iceberg.io-impl=org.apache.iceberg.azure.adlsv2.ADLSFileIO debezium.sink.iceberg.include-credentials=true

i am doing all things on amazon EC2 instance

"iceberg.catalog.type": "rest",

"iceberg.catalog.uri": "https://catalog:8181",

"iceberg.catalog.warehouse": "abfss://storage-container-name@storageaccount.dfs.core.windows.net/warehouse",

"iceberg.catalog.io-impl": "org.apache.iceberg.azure.adlsv2.ADLSFileIO",

"iceberg.catalog.include-credentials": "true"

hi @ismailsimsek please check above values.

ismailsimsek commented 6 months ago

jdbc config looks correct but the error message you shared is about hive catalog! what is the issue you having using jdbc catalog? also do you have the postgresql database running on the EC2 instance (jdbc:postgresql://localhost:5432/dremio)

GOVINDARAMTEKKAR97 commented 6 months ago

If I am using AWS configuration parameters that I shared with you already everything is working fine Able to store parquet and json files to s3 . But when I am doing for azure blob storage it's giving me the error My database running on local host .


From: ismail simsek @.> Sent: Wednesday, May 22, 2024 7:59:53 PM To: memiiso/debezium-server-iceberg @.> Cc: Govinda Ramtekkar @.>; Mention @.> Subject: Re: [memiiso/debezium-server-iceberg] How to push Iceberg data to Azure Blob Storage ? (Issue #325)

jdbc config looks correct but the error message you shared is about hive catalog! what is the issue you having using jdbc catalog? also do you have the postgresql database running on the EC2 instance (jdbc:postgresql://localhost:5432/dremio)

— Reply to this email directly, view it on GitHubhttps://github.com/memiiso/debezium-server-iceberg/issues/325#issuecomment-2124948596, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BE7SFW3I2EPVGWEEYFRGJE3ZDSTWDAVCNFSM6AAAAABH7Y5EVWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUHE2DQNJZGY. You are receiving this because you were mentioned.Message ID: @.***>

GOVINDARAMTEKKAR97 commented 5 months ago

Hi @ismailsimsek , can you try once for azure blob storage and let me know which parameter to use or , what should I change with values in application.properties and that can I used to test again.

ismailsimsek commented 5 months ago

@GOVINDARAMTEKKAR97 please check the example https://github.com/tabular-io/iceberg-kafka-connect?tab=readme-ov-file#azure-adls-configuration-example

  1. make sure you set the env variables mentioned in the example
  2. and set the warehouse, io-impl, include-credentials in the application.properties