projectnessie / iceberg-catalog-migrator

CLI tool to bulk migrate the tables from one catalog another without a data copy
Apache License 2.0
59 stars 13 forks source link

When I register tables from HADOOP to NESSIE,There is a com.amazonaws.AmazonClientException: #50

Open sxh-lsc opened 1 year ago

sxh-lsc commented 1 year ago

My hadoop warehouse is S3a://XXXXXX, and I add the --source-catalog-hadoop-conf fs.s3a.access.key=$AWS_ACCESS_KEY_ID,fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY,fs.s3a.endpoint=$AWS_S3_ENDPOINT Then goes wrong with:

com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
        at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
        at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:960)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
        at org.apache.iceberg.hadoop.HadoopCatalog.isDirectory(HadoopCatalog.java:175)
        at org.apache.iceberg.hadoop.HadoopCatalog.isNamespace(HadoopCatalog.java:376)
        at org.apache.iceberg.hadoop.HadoopCatalog.listNamespaces(HadoopCatalog.java:306)
        at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.getAllNamespacesFromSourceCatalog(CatalogMigrator.java:202)
        at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.getMatchingTableIdentifiers(CatalogMigrator.java:97)
        at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:136)
        at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:38)
        at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
        at picocli.CommandLine.access$1500(CommandLine.java:148)
        at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
        at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
        at picocli.CommandLine.execute(CommandLine.java:2170)
        at org.projectnessie.tools.catalog.migration.cli.CatalogMigrationCLI.main(CatalogMigrationCLI.java:48)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279)
        at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
        at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
        at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
        at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
        at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
        ... 23 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field: 
        true
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:610)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1718)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2883)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
        at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at java.xml/com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1216)
        at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
        ... 29 more
ajantha-bhat commented 1 year ago

Is the warehouse path for the source catalog is same as what is configured from with the engine like spark when the tables are created with the hadoop catalog?

I usually export AWS credentials in the env variable

export AWS_ACCESS_KEY_ID=xxxxxxx
export AWS_SECRET_ACCESS_KEY=xxxxxxx
export AWS_S3_ENDPOINT=xxxxxxx

and also configure file-io in the catalog properties io-impl=org.apache.iceberg.aws.s3.S3FileIO

sxh-lsc commented 1 year ago

Yes it is the same path as you said,but sorry I don't quite understand what problem this will cause? All the env variables you mentioned are exported.

ajantha-bhat commented 1 year ago

Yes it is the same path as you said,but sorry I don't quite understand what problem this will cause? All the env variables you mentioned are exported.

I am not sure what causes this. But I found a similar issue discussion here. Seems to be AWS specific. https://knowledge.informatica.com/s/article/517098?language=en_US

Are you sure you have configured file-io in the catalog properties io-impl=org.apache.iceberg.aws.s3.S3FileIO ?

sxh-lsc commented 1 year ago

Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738) at

yes,I am sure.This below is my command. java -jar iceberg-catalog-migrator-cli-0.2.0.jar register --stacktrace \ --source-catalog-type HADOOP \ --source-catalog-properties warehouse=s3a://**/***/***,io-impl=org.apache.iceberg.aws.s3.S3FileIO \ --source-catalog-hadoop-conf fs.s3a.access.key=$AWS_ACCESS_KEY_ID,fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY,fs.s3a.endpoint=$AWS_S3_ENDPOINT \ --target-catalog-type NESSIE \ --target-catalog-properties uri=http://l***:19120/api/v1/,ref=main,warehouse=s3a://***,io-impl=org.apache.iceberg.aws.s3.S3FileIO The error messages seem like S3 list object went wrong,I found some people use .withEncodingType("url") to fix it,maybe it is about the aws S3 version?

ajantha-bhat commented 1 year ago

Which version of Iceberg are you using? I will also try once locally.

As a workaround, you can pass the list of identifiers in --identifiers option

sxh-lsc commented 1 year ago

Which version of Iceberg are you using? I will also try once locally.

As a workaround, you can pass the list of identifiers in --identifiers option

I use iceberg v1.2.0. But it is already included in this tool, isn't it?