[SUPPORT] Hi, I am trying to use S3 bucket(minio) for storing incremental updates and when i run spark job at the stage where hive synch is done class not found exception is occuring

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced Hi, I am trying to use S3 bucket(minio) for storing incremental updates and when i run spark job at the stage where hive synch is done class not found exception is occuring. ' java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found'

Query is where to define jars relevant to aws for hive to resolve the missing jars ?

spark-submit \ --jars /opt/hive/lib/aws-java-sdk-1.11.30.jar,/opt/hive/lib/hadoop-aws-2.7.3.jar \ --packages org.apache.spark:spark-avro_2.12:3.0.1,org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.hadoop:hadoop-aws:2.7.3 \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \ --table-type MERGE_ON_READ \ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \ --source-ordering-field END_STR \ --target-base-path s3a://table-format-hudi/pgw_sessions_main_mor \ --target-table pgw_sessions_main_mor --props /var/demo/config/kafka-source.properties_vaz \ --schemaprovider-class org.apache.hudi.utilities.schema.SchemaRegistryProvider \ --hoodie-conf hoodie.datasource.write.partitionpath.field=END_STR:timestamp \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd HH:mm:ss.sss" \ --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy/MM/dd" \ --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator \ --hoodie-conf hoodie.datasource.write.operation=upsert \ --enable-sync \ --hoodie-conf hoodie.compact.inline=true \ --hoodie-conf hoodie.compact.schedule.inline=false \ --hoodie-conf hoodie.compact.inline.max.delta.commits=4 \ --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://hiveserver:10000 \ --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor \ --hoodie-conf hoodie.datasource.hive_sync.enable=true \ --hoodie-conf hoodie.datasource.hive_sync.auto_create_datab=true \ --hoodie-conf hoodie.datasource.hive_sync.database=hudidatabase \ --hoodie-conf hoodie.datasource.hive_sync.table=hudi_data_test \ --hoodie-conf hoodie.datasource.hive_sync.partition_fields=['DATE_STRING'] \ --source-limit 20000 A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

It seems like you're facing a ClassNotFoundException for org.apache.hadoop.fs.s3a.S3AFileSystem. This class is part of the hadoop-aws package, which is used to interact with AWS S3.

The error indicates that the required JAR files are not available in the classpath during the runtime of your Spark job. You have already included hadoop-aws and aws-java-sdk in your spark-submit command, but it seems like Hive is not able to find them.

Here is a plan to resolve this issue:

Verify the paths of aws-java-sdk and hadoop-aws JAR files: Make sure that the paths /opt/hive/lib/aws-java-sdk-1.11.30.jar and /opt/hive/lib/hadoop-aws-2.7.3.jar are correct and these files exist at these locations.
Add JARs to Hive classpath: You need to add these JAR files to the Hive classpath. You can do this by setting HIVE_AUX_JARS_PATH environment variable. You can set this variable in hive-env.sh file which is located in the conf directory of your Hive installation.
Restart Hive service: After updating hive-env.sh, you need to restart your Hive service so that the changes can take effect.
Re-run Spark job: Now, you can re-run your Spark job and it should be able to find the required classes.

Please note that the exact steps might vary depending on your environment and Hive installation.

numberlabs-developers / hudi

[SUPPORT] Hi, I am trying to use S3 bucket(minio) for storing incremental updates and when i run spark job at the stage where hive synch is done class not found exception is occuring #209