[SUPPORT] Biquery support in Hudi using Pyspark code

torvalds-dev-testbot[bot] commented 4 months ago

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced A clear and concise description of the problem: User is looking for resources on implementing Hudi in Pyspark for BigQuery sync.

To Reproduce Steps to reproduce the behavior: 1. 2. 3. 4.

Expected behavior A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context Add any other context about the problem here.

Stacktrace Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 4 months ago

Relevant Sources:

https://dzone.com/articles/delta-hudi-and-iceberg-the-data-lakehouse-trifecta

https://hudi.apache.org/docs/sql_queries

https://hudi.apache.org/docs/gcp_bigquery

https://hudi.apache.org/docs/syncing_metastore

https://github.com/apache/hudi/issues/7117

torvalds-dev-testbot[bot] commented 4 months ago

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 4 months ago

Based on the documentation provided, yes, Apache Hudi does support integration with Google BigQuery using PySpark code. You can create and query Hudi tables in BigQuery using external tables, specifically for hive-style partitioned Copy-On-Write and Read-Optimized Merge-On-Read tables. The BigQuerySyncTool in Hudi version 0.14.0 supports syncing tables to BigQuery using manifests, providing benefits like improved query performance and schema evolution synchronization.

For syncing data to BigQuery with the manifest approach:

Set "hoodie.gcp.bigquery.sync.use_bq_manifest_file" to true.
The BigQuerySyncTool creates manifest files representing current base files and updates the BigQuery table schema accordingly.
Benefits include cost-efficient queries, synced schemas, improved query performance, and optimized partitioning.

To summarize, Apache Hudi allows seamless integration with Google BigQuery through external tables, syncs data efficiently using manifest files, and ensures schema consistency for improved query performance and data analysis.

Sources:

torvalds-dev-testbot[bot] commented 4 months ago

Please connect your slack account on torvalds.dev to publish the message on behalf of the user and try again later. You can accept permission at - https://torvalds-dev-nl-ghco-nl-ghcos-projects.vercel.app/settings

torvalds-dev-testbot[bot] commented 4 months ago

Based on the documentation provided, yes, Apache Hudi does support integration with Google BigQuery using PySpark code. You can create and query Hudi tables in BigQuery using external tables, specifically for hive-style partitioned Copy-On-Write and Read-Optimized Merge-On-Read tables. The BigQuerySyncTool in Hudi version 0.14.0 supports syncing tables to BigQuery using manifests, providing benefits like improved query performance and schema evolution synchronization.

For syncing data to BigQuery with the manifest approach:

Set "hoodie.gcp.bigquery.sync.use_bq_manifest_file" to true.
The BigQuerySyncTool creates manifest files representing current base files and updates the BigQuery table schema accordingly.
Benefits include cost-efficient queries, synced schemas, improved query performance, and optimized partitioning.

To summarize, Apache Hudi allows seamless integration with Google BigQuery through external tables, syncs data efficiently using manifest files, and ensures schema consistency for improved query performance and data analysis.

Sources:

torvalds-dev-testbot[bot] commented 4 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/226

torvalds-dev-testbot[bot] commented 4 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/226

numberlabs-developers / hudi

[SUPPORT] Biquery support in Hudi using Pyspark code #226