Open torvalds-dev-testbot[bot] opened 10 months ago
torvalds.dev is analyzing the ticket
Understand the issue: The user is facing slower Spark jobs when doing upsert to a CoW table using the BLOOM index type in Hudi 0.12.2. They have enabled certain options for metadata and have noticed long gaps between jobs in the Spark UI. During these gaps, they see messages indicating that all the commits since the table was created are being loaded. They are unsure if this is expected behavior and are looking for ways to improve it.
Suggested actions:
Relevant files: The file rfc/rfc-37/rfc-37.md
is relevant to the issue. This file discusses the proposal for a Metadata based Bloom Index to boost the performance of the existing bloom index. This could potentially address the user's issue with slower Spark jobs. The user should read this file to understand the proposed changes and how they might impact their use case.
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced Hello! :wave: We have a CoW table which is using the BLOOM index type. We started seing slower Spark jobs when doing upsert to the table. We are using Hudi 0.12.2 The following options are enabled for metadata : "hoodie.metadata.enable": "true" "hoodie.metadata.index.async": "true" "hoodie.metadata.index.bloom.filter.enable": "true" "hoodie.metadata.index.column.stats.enable": "true" One thing we discovered is that in the Spark UI we have long gaps between jobs. During those gaps, in the log we see similar messages: 24/01/17 09:36:11 INFO S3NativeFileSystem: Opening 's3://hudi-s3-bucket/mytable/.hoodie/metadata/.hoodie/20231101223454865001.commit' for reading It seems that we are loading all the commits since the table was created. Is that an expected behaviour? Are there any ways to improve that? A clear and concise description of the problem.
To Reproduce
Steps to reproduce the behavior:
1. 2. 3. 4.
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.