[SUPPORT] Biquery support in Hudi using PySpark code

numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.

https://hudi.apache.org/

Apache License 2.0

0 stars 0 forks source link

[SUPPORT] Biquery support in Hudi using PySpark code #255

Open torvalds-dev-testbot[bot] opened 2 months ago

torvalds-dev-testbot[bot] commented 2 months ago

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

I am looking for resources on how to implement Hudi in PySpark for BigQuery sync. I cannot find any relevant information on this.

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 2 months ago

Relevant Sources:

https://dzone.com/articles/delta-hudi-and-iceberg-the-data-lakehouse-trifecta

https://hudi.apache.org/docs/sql_queries

https://hudi.apache.org/docs/gcp_bigquery

https://hudi.apache.org/docs/syncing_metastore

https://github.com/apache/hudi/issues/7117

torvalds-dev-testbot[bot] commented 2 months ago

Hello! 👋 Please wait while I search the knowledge base for an answer; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 2 months ago

Yes, Hudi supports querying the dataset written using PySpark code, and the dataset can be queried via the Spark datasource like any other source unless Hive sync is enabled.

torvalds-dev-testbot[bot] commented 2 months ago

Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 2 months ago

Yes, as per the documentation snippets provided, Hudi supports querying the dataset written via Pyspark using the Spark datasource if Hive sync is not enabled.

torvalds-dev-testbot[bot] commented 2 months ago

Please activate full access mode to enable this feature.

torvalds-dev-testbot[bot] commented 2 months ago

Yes, as per the documentation snippets provided, Hudi supports querying the dataset written via Pyspark using the Spark datasource if Hive sync is not enabled.

torvalds-dev-testbot[bot] commented 2 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/255