Open torvalds-dev-testbot[bot] opened 3 months ago
Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/227
Hello! :wave: Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds
To query a Hudi table in AWS Glue using Spark SQL directly, you need to follow these steps:
Specify 'hudi' as the value for the '--datalake-formats' job parameter in your AWS Glue job configuration. This ensures that Hudi formats are correctly handled by Apache Spark. Source: AWS Glue Documentation
Create a key named '--conf' for your AWS Glue job and set it with specific values to optimize handling of Hudi tables by Apache Spark. This includes settings like 'spark.serializer=org.apache.spark.serializer.KryoSerializer' and 'spark.sql.hive.convertMetastoreParquet=false'. Source: Medium Article - Apache Hudi on AWS Glue
Ensure Lake Formation permission support is enabled for AWS Glue 4.0 to read/register Hudi tables without additional configuration. Assign necessary permissions (SELECT for read, SUPER for write) to the AWS Glue job IAM role. Source: AWS Glue Documentation
In summary, by configuring the job parameters, including Hudi-specific settings, and ensuring appropriate Lake Formation permissions, you can successfully query a Hudi table in AWS Glue using Spark SQL directly. :large_blue_square: Confidence Score: 95% :loudspeaker: Publish Answer button >torvalds.dev is learning and improving. React below to provide feedback! :+1: 0 button :-1: 0 button
Tips before filing an issue
Describe the problem you faced I'm trying to query a Hudi table in AWS Glue using Spark SQL, but I'm unsure about the specific configurations or steps required for this.
To Reproduce Steps to reproduce the behavior: 1. 2. 3. 4.
Expected behavior A clear and concise description of what you expected to happen.
Environment Description
Additional context Add any other context about the problem here.
Stacktrace Add the stacktrace of the error.