numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Hello! :wave: We have a CoW table which is using the BLOOM index type #207

Open torvalds-dev-testbot[bot] opened 10 months ago

torvalds-dev-testbot[bot] commented 10 months ago

Tips before filing an issue

Describe the problem you faced Hello! :wave: We have a CoW table which is using the BLOOM index type. We started seing slower Spark jobs when doing upsert to the table. We are using Hudi 0.12.2 The following options are enabled for metadata : "hoodie.metadata.enable": "true" "hoodie.metadata.index.async": "true" "hoodie.metadata.index.bloom.filter.enable": "true" "hoodie.metadata.index.column.stats.enable": "true" One thing we discovered is that in the Spark UI we have long gaps between jobs. During those gaps, in the log we see similar messages: 24/01/17 09:36:11 INFO S3NativeFileSystem: Opening 's3://hudi-s3-bucket/mytable/.hoodie/metadata/.hoodie/20231101223454865001.commit' for reading It seems that we are loading all the commits since the table was created. Is that an expected behaviour? Are there any ways to improve that? A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

1. 2. 3. 4.

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

torvalds-dev[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev[bot] commented 10 months ago

Plan

  1. Understand the issue: The user is facing slower Spark jobs when doing upsert to a CoW table using the BLOOM index type in Hudi 0.12.2. They have enabled certain options for metadata and have noticed long gaps between jobs in the Spark UI. During these gaps, they see messages indicating that all the commits since the table was created are being loaded. They are unsure if this is expected behavior and are looking for ways to improve it.

  2. Suggested actions:

    • The user could consider upgrading to a newer version of Hudi if available, as performance improvements are often made in newer versions.
    • They could also consider adjusting their Spark job configurations or resources if possible to improve performance.
    • If the issue persists, it would be beneficial to reach out to the Hudi community or support for more specific guidance.
  3. Relevant files: The file rfc/rfc-37/rfc-37.md is relevant to the issue. This file discusses the proposal for a Metadata based Bloom Index to boost the performance of the existing bloom index. This could potentially address the user's issue with slower Spark jobs. The user should read this file to understand the proposed changes and how they might impact their use case.