numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Hi, I am trying to use S3 bucket(minio) for storing incremental updates and when i run spark job at the stage where hive synch is done class not found exception is occuring #209

Open torvalds-dev-testbot[bot] opened 10 months ago

torvalds-dev-testbot[bot] commented 10 months ago

Tips before filing an issue

Describe the problem you faced Hi, I am trying to use S3 bucket(minio) for storing incremental updates and when i run spark job at the stage where hive synch is done class not found exception is occuring. ' java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found'

Query is where to define jars relevant to aws for hive to resolve the missing jars ?

spark-submit \ --jars /opt/hive/lib/aws-java-sdk-1.11.30.jar,/opt/hive/lib/hadoop-aws-2.7.3.jar \ --packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.hadoop:hadoop-aws:2.7.3 \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \ --table-type MERGE_ON_READ \ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \ --source-ordering-field END_STR \ --target-base-path s3a://table-format-hudi/pgw_sessions_main_mor \ --target-table pgw_sessions_main_mor --props /var/demo/config/kafka-source.properties_vaz \ --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \ --hoodie-conf hoodie.datasource.write.partitionpath.field=END_STR:timestamp \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd HH:mm:ss.sss" \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy/MM/dd" \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.operation=upsert \ --enable-sync \ --hoodie-conf hoodie.compact.inline=true \ --hoodie-conf hoodie.compact.schedule.inline=false \ --hoodie-conf hoodie.compact.inline.max.delta.commits=4 \ --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:10000 \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor \ --hoodie-conf hoodie.datasource.hive_sync.enable=true \ --hoodie-conf hoodie.datasource.hive_sync.auto_create_datab=true \ --hoodie-conf hoodie.datasource.hive_sync.database=hudidatabase \ --hoodie-conf hoodie.datasource.hive_sync.table=hudi_data_test \ --hoodie-conf hoodie.datasource.hive_sync.partition_fields=['DATE_STRING'] \ --source-limit 20000 A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev[bot] commented 10 months ago

It seems like you're facing a ClassNotFoundException for org.apache.hadoop.fs.s3a.S3AFileSystem. This class is part of the hadoop-aws package, which is used to interact with AWS S3.

The error indicates that the required JAR files are not available in the classpath during the runtime of your Spark job. You have already included hadoop-aws and aws-java-sdk in your spark-submit command, but it seems like Hive is not able to find them.

Here is a plan to resolve this issue:

  1. Verify the paths of aws-java-sdk and hadoop-aws JAR files: Make sure that the paths /opt/hive/lib/aws-java-sdk-1.11.30.jar and /opt/hive/lib/hadoop-aws-2.7.3.jar are correct and these files exist at these locations.

  2. Add JARs to Hive classpath: You need to add these JAR files to the Hive classpath. You can do this by setting HIVE_AUX_JARS_PATH environment variable. You can set this variable in hive-env.sh file which is located in the conf directory of your Hive installation.

  3. Restart Hive service: After updating hive-env.sh, you need to restart your Hive service so that the changes can take effect.

  4. Re-run Spark job: Now, you can re-run your Spark job and it should be able to find the required classes.

Please note that the exact steps might vary depending on your environment and Hive installation.