projectnessie / iceberg-catalog-migrator

CLI tool to bulk migrate the tables from one catalog another without a data copy
Apache License 2.0
61 stars 13 forks source link

Migrating from Hive to Nessie getting java.io.IOException: No FileSystem for scheme: hdfs #94

Open wilsonpenha opened 1 year ago

wilsonpenha commented 1 year ago

My hadoop env is 3.3.1 my hive env is 3.1.0 my iceberg is 1.2.1 my spark-3.2.1 my nessie-0.59.0 server

export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_CONF_DIR=/etc/hive/conf

java -Djavax.net.ssl.trustStore=/etc/security/clientKeys/client-truststore.jks \
     -Djavax.net.ssl.trustStorePassword=password \
-jar iceberg-catalog-migrator-cli-0.2.1-SNAPSHOT.jar register \
--source-catalog-type HIVE \
--source-catalog-properties warehouse=hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/,uri=thrift://hive-metastore:9083 \
--identifiers hive_data.t_ers_event_perf,hive_data.T_KWH_MATCH_RECORD_PERF \
--target-catalog-type NESSIE \
--target-catalog-properties uri=http://localhost:19120/api/v1,ref=main,warehouse=hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive
$ cat logs/catalog_migration.log 
2023-10-07 00:30:19,852 [main] INFO  o.apache.hadoop.hive.conf.HiveConf - Found configuration file file:/usr/lib/spark/conf/hive-site.xml
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.tez.cartesian-product.enabled does not exist
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.warehouse.external.dir does not exist
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.heapsize does not exist
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.materializedview.rewriting.incremental does not exist
2023-10-07 00:30:20,186 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.webui.cors.allowed.headers does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.hook.proto.base-directory does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.load.data.owner does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.max-partitions-per-writers does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.service.metrics.codahale.reporter.classes does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.strict.managed.tables does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.ignore-absent-partitions does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.create.as.insert.only does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.server2.webui.enable.cors does not exist
2023-10-07 00:30:20,187 [main] WARN  o.apache.hadoop.hive.conf.HiveConf - HiveConf of name hive.metastore.db.type does not exist
2023-10-07 00:30:20,422 [main] WARN  o.a.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-10-07 00:30:20,440 [main] INFO  hive.metastore - Trying to connect to metastore with URI thrift://hive-metastore:9083
2023-10-07 00:30:20,498 [main] INFO  hive.metastore - Opened an SSL connection to metastore, current connections: 1
2023-10-07 00:30:20,851 [main] INFO  hive.metastore - Connected to metastore.
2023-10-07 00:30:21,116 [main] INFO  o.a.i.BaseMetastoreTableOperations - Refreshing table metadata from new version: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
2023-10-07 00:30:21,160 [main] WARN  org.apache.iceberg.util.Tasks - Retrying task after failure: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json
    at org.apache.iceberg.hadoop.Util.getFs(Util.java:58)
    at org.apache.iceberg.hadoop.HadoopInputFile.fromLocation(HadoopInputFile.java:56)
    at org.apache.iceberg.hadoop.HadoopFileIO.newInputFile(HadoopFileIO.java:90)
    at org.apache.iceberg.TableMetadataParser.read(TableMetadataParser.java:266)
    at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$0(BaseMetastoreTableOperations.java:189)
    at org.apache.iceberg.BaseMetastoreTableOperations.lambda$refreshFromMetadataLocation$1(BaseMetastoreTableOperations.java:208)
    at org.apache.iceberg.util.Tasks$Builder.runTaskWithRetry(Tasks.java:413)
    at org.apache.iceberg.util.Tasks$Builder.runSingleThreaded(Tasks.java:219)
    at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:203)
    at org.apache.iceberg.util.Tasks$Builder.run(Tasks.java:196)
    at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:208)
    at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:185)
    at org.apache.iceberg.BaseMetastoreTableOperations.refreshFromMetadataLocation(BaseMetastoreTableOperations.java:180)
    at org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:176)
    at org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:97)
    at org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:80)
    at org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:47)
    at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.registerTableToTargetCatalog(CatalogMigrator.java:212)
    at org.projectnessie.tools.catalog.migration.api.CatalogMigrator.registerTable(CatalogMigrator.java:147)
    at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:159)
    at org.projectnessie.tools.catalog.migration.cli.BaseRegisterCommand.call(BaseRegisterCommand.java:38)
    at picocli.CommandLine.executeUserObject(CommandLine.java:2041)
    at picocli.CommandLine.access$1500(CommandLine.java:148)
    at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2461)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2453)
    at picocli.CommandLine$RunLast.handle(CommandLine.java:2415)
    at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2273)
    at picocli.CommandLine$RunLast.execute(CommandLine.java:2417)
    at picocli.CommandLine.execute(CommandLine.java:2170)
    at org.projectnessie.tools.catalog.migration.cli.CatalogMigrationCLI.main(CatalogMigrationCLI.java:48)
Caused by: java.io.IOException: No FileSystem for scheme: hdfs
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
    at org.apache.iceberg.hadoop.Util.getFs(Util.java:56)
    ... 29 common frames omitted
2023-10-07 00:30:21,272 [main] WARN  org.apache.iceberg.util.Tasks - Retrying task after failure: Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespa..........

same error over and over

ajantha-bhat commented 1 year ago

Failed to get file system for path: hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive/hive_data.db/T_ERS_EVENT_PERF/metadata/00003-f70dd253-791a-499e-9ebd-7a739a461960.metadata.json

Since you are using HDFS file system. You can check if any hadoop conf required to be set. You can use --source-catalog-hadoop-conf CLI option.

wilsonpenha commented 1 year ago

Well this is my first time using it so I don't know what to provide for this property and also the both HADOOP_CONF_DIR and HIVE_CONF_DIR are set so please provide what should I set?

ajantha-bhat commented 1 year ago

Pasting the solution from Zuliip chat discussions

java -Djavax.net.ssl.trustStore=/etc/security/clientKeys/client-truststore.jks \
-Djavax.net.ssl.trustStorePassword=admin1234 \
-Dhadoop.configuration.addResources=$HADOOP_CONF_DIR/core-site.xml \
-Dhadoop.configuration.addResources=$HADOOP_CONF_DIR/hdfs-site.xml \
-Dhadoop.configuration.addResources=$HADOOP_CONF_DIR/hive-site.xml \
-jar iceberg-catalog-migrator-cli-0.2.1-SNAPSHOT.jar \
register \
--source-catalog-type HIVE \
--source-catalog-properties warehouse=hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive,uri=thrift://hadoopoozie1:9083 \
--identifiers hive_data.t_ers_event_perf,hive_data.T_KWH_MATCH_RECORD_PERF \
--target-catalog-type NESSIE \
--target-catalog-properties uri=http://localhost:19120/api/v1,ref=main,warehouse=hdfs://hadoopcluster/sin/ers/warehouse/tablespace/external/hive
wilsonpenha commented 1 year ago

Hey I forgot to mention the solution here which would means change the code base here by someone or you guys may want to test it more?

ajantha-bhat commented 1 year ago

I think we can test it more with manually supplement the jar in class path and use hadoop 2.7.3

wilsonpenha commented 1 year ago

The problem with Hadoop 2.7.3 is that it uses sun.nio.ch.DirectBuffer.cleaner() from java1.8 which was removed from java11 causing exception At final stage of FileInputStream, I looked into the java code and tested that it won't work with java11 you can see the stacktrace at zullichat another thing is we use hadoop3.3.1 so we could try Hadoop3.0.0 anyway that could be different version like one for Hadoop2 and another for Hadoop3 like spark what do you think?

Copying from zullichat: Awesome :+1: We can have a PR to add hadoop-hdfs dependency (or user can manually add the jar to class path also) and I am not sure about changing hadoop version to 3.3.1, because Iceberg expects to work with hadoop 2.7.3 and thats why Iceberg repo also keeps that version.

Using Hadoop-2.7.3 has this implementation which requires java-1.8 runtime as this sun.nio.ch.DirectBuffer was removed after Java-1.9, see full stracktrace above: Exception in thread "main" java.lang.NoSuchMethodError: 'sun.misc.Cleaner sun.nio.ch.DirectBuffer.cleaner()' at org.apache.hadoop.crypto.CryptoStreamUtils.freeDB(CryptoStreamUtils.java:41)