[SUPPORT] Documentation for running Hudi and Delta formats on the same Spark 3.x cluster

torvalds-dev-testbot[bot] commented 3 months ago

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

Describe the problem you faced Is there any documentation available for running both Hudi and Delta formats on the same Spark 3.x cluster to avoid issues with dependent jars?

To Reproduce

Expected behavior

Environment Description

Hudi version :
Spark version : 3.x
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context Add any other context about the problem here.

Stacktrace Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 3 months ago

Relevant Sources:

https://hudi.apache.org/docs/faq_writing_tables

https://hudi.apache.org/docs/quick-start-guide

https://aws.amazon.com/blogs/big-data/part-1-getting-started-introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark/

https://hudi.apache.org/docs/faq_integrations

https://apache-hudi.slack.com/archives/C4D716NPQ/p1712686953170449

torvalds-dev-testbot[bot] commented 3 months ago

Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 3 months ago

Follow up questions:

Which specific versions of Spark and Hudi are you currently using in your cluster when running Hudi and Delta formats together?
Could you provide more details on the configurations and settings you have in place for running both Hudi and Delta formats to avoid any conflicts with dependent jars on the Spark 3.x cluster?
Can you explain the workflow you have in mind for integrating Hudi and Delta formats on the same Spark 3.x cluster to ensure smooth operation without issues related to dependent jars?

Sources:

torvalds-dev-testbot[bot] commented 3 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/247

torvalds-dev-testbot[bot] commented 3 months ago

Follow up questions:

Which specific versions of Spark and Hudi are you currently using in your cluster when running Hudi and Delta formats together?
Could you provide more details on the configurations and settings you have in place for running both Hudi and Delta formats to avoid any conflicts with dependent jars on the Spark 3.x cluster?
Can you explain the workflow you have in mind for integrating Hudi and Delta formats on the same Spark 3.x cluster to ensure smooth operation without issues related to dependent jars?

Sources:

torvalds-dev-testbot[bot] commented 3 months ago

Hello! :wave: Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 3 months ago

To run both Hudi and Delta formats on the same Spark 3.x cluster without issues with dependent jars, you can follow these steps:

Make sure to disable Spark's dynamic allocation feature by setting "spark.dynamicAllocation.enabled": "false" in the Spark configuration. (source: https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-transparent-tpc-ds-lakehouse-performance-benchmarks)
Configure the Hudi table creation to use Hive style partitioning by adding the parameter hoodie.datasource.write.hive_style_partitioning: true. This will create partition folders similar to Hive's structure. (source: https://hudi.apache.org/docs/faq_writing_tables)
Register your Hudi table with the Apache Hive metastore for seamless integration using either the standalone Hive Sync tool or Hudi Streamer tool or datasource options. (source: https://hudi.apache.org/docs/faq_integrations)

By following the above steps, you can successfully run both Hudi and Delta formats on the same Spark 3.x cluster to ensure smooth operations and avoid any jar dependency issues. :large_blue_square: Confidence Score: 92.94% :loudspeaker: Publish Answer button >torvalds.dev is learning and improving. React below to provide feedback! :+1: 0 button :-1: 0 button

torvalds-dev-testbot[bot] commented 3 months ago

To run both Hudi and Delta formats on the same Spark 3.x cluster without issues with dependent jars, you can follow these steps:

Make sure to disable Spark's dynamic allocation feature by setting "spark.dynamicAllocation.enabled": "false" in the Spark configuration. (source: https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-transparent-tpc-ds-lakehouse-performance-benchmarks)
Configure the Hudi table creation to use Hive style partitioning by adding the parameter hoodie.datasource.write.hive_style_partitioning: true. This will create partition folders similar to Hive's structure. (source: https://hudi.apache.org/docs/faq_writing_tables)
Register your Hudi table with the Apache Hive metastore for seamless integration using either the standalone Hive Sync tool or Hudi Streamer tool or datasource options. (source: https://hudi.apache.org/docs/faq_integrations)

By following the above steps, you can successfully run both Hudi and Delta formats on the same Spark 3.x cluster to ensure smooth operations and avoid any jar dependency issues.

numberlabs-developers / hudi

[SUPPORT] Documentation for running Hudi and Delta formats on the same Spark 3.x cluster #247