numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Biquery support in Hudi using Pyspark code #226

Open torvalds-dev-testbot[bot] opened 1 month ago

torvalds-dev-testbot[bot] commented 1 month ago

Tips before filing an issue

Describe the problem you faced A clear and concise description of the problem: User is looking for resources on implementing Hudi in Pyspark for BigQuery sync.

To Reproduce Steps to reproduce the behavior: 1. 2. 3. 4.

Expected behavior A clear and concise description of what you expected to happen.

Environment Description

Additional context Add any other context about the problem here.

Stacktrace Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 1 month ago

Relevant Sources:

https://dzone.com/articles/delta-hudi-and-iceberg-the-data-lakehouse-trifecta

https://hudi.apache.org/docs/sql_queries

https://hudi.apache.org/docs/gcp_bigquery

https://hudi.apache.org/docs/syncing_metastore

https://github.com/apache/hudi/issues/7117

torvalds-dev-testbot[bot] commented 1 month ago

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 1 month ago

Based on the documentation provided, yes, Apache Hudi does support integration with Google BigQuery using PySpark code. You can create and query Hudi tables in BigQuery using external tables, specifically for hive-style partitioned Copy-On-Write and Read-Optimized Merge-On-Read tables. The BigQuerySyncTool in Hudi version 0.14.0 supports syncing tables to BigQuery using manifests, providing benefits like improved query performance and schema evolution synchronization.

For syncing data to BigQuery with the manifest approach:

  1. Set "hoodie.gcp.bigquery.sync.use_bq_manifest_file" to true.
  2. The BigQuerySyncTool creates manifest files representing current base files and updates the BigQuery table schema accordingly.
  3. Benefits include cost-efficient queries, synced schemas, improved query performance, and optimized partitioning.

To summarize, Apache Hudi allows seamless integration with Google BigQuery through external tables, syncs data efficiently using manifest files, and ensures schema consistency for improved query performance and data analysis.

Sources:

torvalds-dev-testbot[bot] commented 1 month ago

Please connect your slack account on torvalds.dev to publish the message on behalf of the user and try again later. You can accept permission at - https://torvalds-dev-nl-ghco-nl-ghcos-projects.vercel.app/settings

torvalds-dev-testbot[bot] commented 1 month ago

Based on the documentation provided, yes, Apache Hudi does support integration with Google BigQuery using PySpark code. You can create and query Hudi tables in BigQuery using external tables, specifically for hive-style partitioned Copy-On-Write and Read-Optimized Merge-On-Read tables. The BigQuerySyncTool in Hudi version 0.14.0 supports syncing tables to BigQuery using manifests, providing benefits like improved query performance and schema evolution synchronization.

For syncing data to BigQuery with the manifest approach:

  1. Set "hoodie.gcp.bigquery.sync.use_bq_manifest_file" to true.
  2. The BigQuerySyncTool creates manifest files representing current base files and updates the BigQuery table schema accordingly.
  3. Benefits include cost-efficient queries, synced schemas, improved query performance, and optimized partitioning.

To summarize, Apache Hudi allows seamless integration with Google BigQuery through external tables, syncs data efficiently using manifest files, and ensures schema consistency for improved query performance and data analysis.

Sources:

torvalds-dev-testbot[bot] commented 1 month ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/226

torvalds-dev-testbot[bot] commented 1 month ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/226