memiiso / debezium-server-iceberg

Replicates any database (CDC events) to Apache Iceberg (To Cloud Storage)
Apache License 2.0
185 stars 35 forks source link

AWS SDK isn't bundled with the application #303

Open prakash-42 opened 5 months ago

prakash-42 commented 5 months ago

Hi! I wasn't sure about the correct forum for asking my question, hope this is the right place.

When I tried to package and run the application (following the steps in the README), I got the following error:

java.lang.NoClassDefFoundError: com/amazonaws/AmazonClientException

I think AWS SDK isn't bundled by default with the application. Do I need to add this dependency myself (by modifying project's pom.xml), or is there a different recommended way for getting the AWS SDK libraries at the runtime?

I did notice that PR for issue #201 explicitly removes the AWS SDK, but I couldn't understand the motivation behind that. Please guide me on this, thank you!

ismailsimsek commented 5 months ago

@prakash-42 if you use org.apache.iceberg.aws.s3.S3FileIO you don't need the aws bundle. thats the recommended fileIo to use for aws/s3

example setup below: https://github.com/memiiso/debezium-server-iceberg/blob/f417423d1c338322fc599986b57bc999b81e6083/examples/conf/application.properties#L18-L22

further details in iceberg documentation

prakash-42 commented 5 months ago

Thanks for your response @ismailsimsek . The error went away after I switched to using the S3FileIO instead of org.apache.hadoop.fs.s3a.S3AFileSystem. I have however run into a different problem after this.

I am trying to setup this project with the catalog-impl as org.apache.iceberg.aws.glue.GlueCatalog. Here's my configuration properties for the same:

# Iceberg sink config
debezium.sink.iceberg.table-prefix=debeziumcdc_
debezium.sink.iceberg.upsert=true
debezium.sink.iceberg.upsert-keep-deletes=true
debezium.sink.iceberg.write.format.default=parquet
debezium.sink.iceberg.catalog-name=mycatalog

# S3 config using Glue catalog And S3FileIO
debezium.sink.iceberg.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
debezium.sink.iceberg.io-impl=org.apache.iceberg.aws.s3.S3FileIO
debezium.sink.iceberg.warehouse=s3://poc_bucket/icebergcatalog
# debezium.sink.iceberg.type=iceberg # Gives error
debezium.sink.iceberg.catalog-type=hadoop
debezium.sink.iceberg.format-version=2

When I try to run the application, it fails on startup with the following error:

Caused by: org.apache.iceberg.exceptions.ValidationException: Invalid S3 URI, cannot determine scheme: file:/home/glue_use
r/workspace/spark-warehouse/debezium_offset_storage_custom_table/metadata/00000-2a2503fc-a6db-47f2-9ac9-ce21a29322cb.metadata.json
        at org.apache.iceberg.exceptions.ValidationException.check(ValidationException.java:49)
        at org.apache.iceberg.aws.s3.S3URI.<init>(S3URI.java:72)

I'm not sure what property I should set so that it creates paths like s3:// instead of file:/. (I thought that the debezium.sink.iceberg.warehouse should control this part, but now I'm not sure). Can you suggest me any tips for debugging this? Sorry for pestering you, I think this tool can greatly simplify our data lake's CDC process and hence wanted to set it up.

ismailsimsek commented 5 months ago

@prakash-42 you dont need second line below, this two are same and setting the catalog

debezium.sink.iceberg.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
debezium.sink.iceberg.catalog-type=hadoop

outside of that config looks correct to me.

leaving here documentation for aws iceberg integration https://iceberg.apache.org/docs/1.5.0/aws/#glue-catalog