[SUPPORT] Need help with querying a Hudi table in AWS Glue using Spark SQL

torvalds-dev-testbot[bot] commented 4 months ago

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced I'm trying to query a Hudi table in AWS Glue using Spark SQL, but I'm unsure about the specific configurations or steps required for this.

To Reproduce Steps to reproduce the behavior: 1. 2. 3. 4.

Expected behavior A clear and concise description of what you expected to happen.

Environment Description

Hudi version : 0.14.1
Spark version : 3.3
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context Add any other context about the problem here.

Stacktrace Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 4 months ago

Relevant Sources:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html

https://medium.com/@devjain1299/apache-hudi-on-aws-glue-a-step-by-step-guide-503c34a9aa95

https://hudi.apache.org/docs/quick-start-guide

https://github.com/apache/hudi/issues/10358

torvalds-dev-testbot[bot] commented 4 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/227

torvalds-dev-testbot[bot] commented 4 months ago

Hello! :wave: Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 4 months ago

To query a Hudi table in AWS Glue using Spark SQL directly, you need to follow these steps:

Specify 'hudi' as the value for the '--datalake-formats' job parameter in your AWS Glue job configuration. This ensures that Hudi formats are correctly handled by Apache Spark. Source: AWS Glue Documentation
Create a key named '--conf' for your AWS Glue job and set it with specific values to optimize handling of Hudi tables by Apache Spark. This includes settings like 'spark.serializer=org.apache.spark.serializer.KryoSerializer' and 'spark.sql.hive.convertMetastoreParquet=false'. Source: Medium Article - Apache Hudi on AWS Glue
Ensure Lake Formation permission support is enabled for AWS Glue 4.0 to read/register Hudi tables without additional configuration. Assign necessary permissions (SELECT for read, SUPER for write) to the AWS Glue job IAM role. Source: AWS Glue Documentation

In summary, by configuring the job parameters, including Hudi-specific settings, and ensuring appropriate Lake Formation permissions, you can successfully query a Hudi table in AWS Glue using Spark SQL directly. :large_blue_square: Confidence Score: 95% :loudspeaker: Publish Answer button >torvalds.dev is learning and improving. React below to provide feedback! :+1: 0 button :-1: 0 button

numberlabs-developers / hudi

[SUPPORT] Need help with querying a Hudi table in AWS Glue using Spark SQL #227