tabular-io / iceberg-kafka-connect

Apache License 2.0
192 stars 41 forks source link

Example to use this tool with ADLS? #94

Open ajantha-bhat opened 11 months ago

ajantha-bhat commented 11 months ago

Thanks for the configuration examples with S3.

I saw that Iceberg has ADLSFileIO and we can configure that. we just need to configure that in catalog properties and also configure all the properties from AzureProperties?

I didn't find the catalog level example or testcase with ADLS even in the Iceberg repo. https://github.com/apache/iceberg/issues/8662

bryanck commented 11 months ago

Yes, you'd configure it the same way as other FileIOs. You can set catalog properties and also use the default credential provider.

ajantha-bhat commented 11 months ago

Hi @bryanck,

I was trying the same integration tests with ADLS + Nessie catalog.

static Map<String, Object> connectorNessieCatalogProperties() {
    return ImmutableMap.<String, Object>builder()
        .put("iceberg.catalog.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
        .put("iceberg.catalog." + CatalogProperties.URI, "http://nessie:" + NESSIE_CATALOG_PORT + "/api/v1")
        .put("iceberg.catalog." + CatalogProperties.WAREHOUSE_LOCATION, "wasbs://<container>@<account>.blob.core.windows.net/warehouse/")
        .put("iceberg.catalog.fs.azure.account.key.<account>.blob.core.windows.net", "<account key>")
        .put("iceberg.catalog.fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
        .build();
  } 

As mentioned above, I have used storage account access-keys with HadoopFileIO. I got an error

2023-10-04 20:17:33 Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
2023-10-04 20:17:33     at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2693)
2023-10-04 20:17:33     at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3573)
2023-10-04 20:17:33     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3608)
2023-10-04 20:17:33     at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
2023-10-04 20:17:33     at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3712)
2023-10-04 20:17:33     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3663)
2023-10-04 20:17:33     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:557)
2023-10-04 20:17:33     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
2023-10-04 20:17:33     at org.apache.iceberg.hadoop.Util.getFs(Util.java:56)
2023-10-04 20:17:33     at org.apache.iceberg.hadoop.HadoopInputFile.fromLocation(HadoopInputFile.java:56)
2023-10-04 20:17:33     at org.apache.iceberg.hadoop.HadoopFileIO.newInputFile(HadoopFileIO.java:90)
2023-10-04 20:17:33     at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:266)
2023-10-04 20:17:33     at org.apache.iceberg.nessie.NessieTableOperations.loadTableMetadata(NessieTableOperations.java:86)

It needs Hadoop azure jars. I am not sure supplementing jar is a good idea for a docker based kafaka-connect testing.

Have you tried with access-key before without supplementing a jar? Is there any other way? I saw ADLSFileIO but it looks it can work only with SAS token and also there is no catalog level example for using that.

bryanck commented 11 months ago

You can use ADLSFileIO with the default credential provider if you're not using a SAS token, see https://learn.microsoft.com/en-us/java/api/com.azure.identity.defaultazurecredential?view=azure-java-stable for more info on that. When using Hadoop FS you can try using the Hive distribution, which includes the Hadoop client.

bryanck commented 11 months ago

For example, you can set the Azure environment variables and those will get picked up. I'll add an example for GCP and Azure soon.

ajantha-bhat commented 11 months ago

ok. Thanks.

liko9 commented 6 months ago

in progress: https://github.com/tabular-io/iceberg-kafka-connect/pull/193