numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] How to store schema in schema registry for capturing nested arrays and JSONs with column names for Hudi table onboarding from Mongo data source? #253

Open torvalds-dev-testbot[bot] opened 3 months ago

torvalds-dev-testbot[bot] commented 3 months ago

Tips before filing an issue

Describe the problem you faced

Can someone suggest how to store the schema in schema registry to capture all complex nested arrays and JSONs with column names for Hudi table onboarding when the source is a Mongo data source (e.g., when opting for a static schema nature data)?

To Reproduce

Steps to reproduce the behavior: 1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev-testbot[bot] commented 3 months ago

Relevant Sources:

https://hudi.apache.org/docs/faq_storage

https://hudi.apache.org/docs/schema_evolution

https://aws.amazon.com/blogs/big-data/automate-schema-evolution-at-scale-with-apache-hudi-in-aws-glue/

https://github.com/apache/hudi/issues/10895

torvalds-dev-testbot[bot] commented 3 months ago

Hello! 👋 Please wait while I generate the follow up questions; this usually takes about 30 seconds

torvalds-dev-testbot[bot] commented 3 months ago

Based on the provided documentation snippets, in order to store the schema in a schema registry to capture all complex nested arrays and JSONs with column names for a Hudi table onboarding when the source is a Mongo data source with static schema nature data, you can follow these steps:

  1. Enable schema reconciliation using the config hoodie.datasource.write.reconcile.schema.
  2. Configure the schema evolution scenarios on write to support schema changes such as adding nullable fields or promoting field datatypes.
  3. Use the hoodie utilities to ingest data from Kafka with the Schema Registry and Confluent Kafka set up.
  4. Utilize AWS Glue Data Catalog with Apache Hudi to register tables and interact with data stored in S3.
  5. Make use of Spark to write data to a Hudi table with specific options such as storage type, record key field, table name, partitioning, and more.

By following these steps and configurations, you can effectively store the schema in a schema registry to handle complex nested data structures in Hudi tables when onboarding data from a Mongo source.

Sources:

torvalds-dev-testbot[bot] commented 3 months ago

Not satisfied with the answer. Can you re-look into answer generated by you?

torvalds-dev-testbot[bot] commented 3 months ago

Ticket created successfully. Here is the link to the GitHub issue: https://github.com/numberlabs-developers/hudi/issues/253