numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Querying Hudi table in Glue using Spark SQL #237

Open torvalds-dev-testbot[bot] opened 3 months ago

torvalds-dev-testbot[bot] commented 3 months ago

Tips before filing an issue

Describe the problem you faced

I am seeking guidance on the special configuration or steps required to query a Hudi table in Glue using Spark SQL directly.

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 3 months ago

Relevant Sources:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html

https://medium.com/@devjain1299/apache-hudi-on-aws-glue-a-step-by-step-guide-503c34a9aa95

https://hudi.apache.org/docs/quick-start-guide

https://github.com/apache/hudi/issues/10358

torvalds-dev-testbot[bot] commented 3 months ago

Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 3 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/237

torvalds-dev-testbot[bot] commented 3 months ago

To query a Hudi table in AWS Glue using Spark SQL directly, you need to follow these steps:

  1. Specify "hudi" as a value for the --datalake-formats job parameter and create a key named --conf with specific values to handle Hudi tables correctly in AWS Glue.
  2. Ensure Lake Formation permission support is enabled by default for AWS Glue 4.0. No additional configuration is needed for reading/writing to Lake Formation-registered Hudi tables.
  3. To use a different Hudi version not supported by AWS Glue, specify your own Hudi JAR files using the --extra-jars job parameter.

For Spark configuration, make sure to set options like spark.serializer, spark.sql.hive.convertMetastoreParquet, and spark.sql.extensions to interact properly with Hudi tables in Glue.

Further steps involve setting up a Glue ETL job with specific configurations, connecting to data sources, converting dynamic frames to Spark DataFrames, specifying database and table names for the Hudi table, and configuring Hudi properties like storage type, record key field, table name, operation type, partition path field, and more.

When integrating with Glue Data Catalog, use methods like GlueContext.create_data_frame.from_catalog() to read Hudi tables, and ensure proper permissions are granted for SELECT and SUPER roles based on read and write operations.

For detailed guides on setting up Hudi with AWS Glue, refer to the provided URLs.

Sources:

torvalds-dev-testbot[bot] commented 3 months ago

Good answer.

torvalds-dev-testbot[bot] commented 3 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/237

torvalds-dev-testbot[bot] commented 3 months ago

Abhishek posted:

https://devs-numberlabs.slack.com/archives/C06C78R8KGW/p1722432674478679?thread_ts=1722445277609949&cid=C06C78R8KGW