numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT] Hello! :wave: We have a CoW table which is using the BLOOM index type #203

Open torvalds-dev-testbot[bot] opened 10 months ago

torvalds-dev-testbot[bot] commented 10 months ago

Tips before filing an issue

Describe the problem you faced Hello! :wave:

We have a CoW table which is using the BLOOM index type. We started seing slower Spark jobs when doing upsert to the table. We are using Hudi 0.12.2

The following options are enabled for metadata :


"hoodie.metadata.index.async": "true"
"hoodie.metadata.index.bloom.filter.enable": "true"
"hoodie.metadata.index.column.stats.enable": "true"```

One thing we discovered is that in the Spark UI we have long gaps between jobs.
During those gaps, in the log we see similar messages:
`24/01/17 09:36:11 INFO S3NativeFileSystem: Opening '<s3://hudi-s3-bucket/mytable/.hoodie/metadata/.hoodie/20231101223454865001.commit>' for reading`
It seems that we are loading all the commits since the table was created.

Is that an expected behaviour?
Are there any ways to improve that?
A clear and concise description of the problem.

**To Reproduce**

Steps to reproduce the behavior:

1.
2.
3.
4.

**Expected behavior**

A clear and concise description of what you expected to happen.

**Environment Description**

* Hudi version :

* Spark version :

* Hive version :

* Hadoop version :

* Storage (HDFS/S3/GCS..) :

* Running on Docker? (yes/no) :

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```
torvalds-dev[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev[bot] commented 10 months ago

Plan

  1. Extract the text from the PDF file rfc/rfc-37/rfc-37.md.
  2. Summarize the extracted text.
  3. Understand the issue described by the user.
  4. Suggest relevant actions that should be taken.
  5. Provide the names of files that are relevant to the issue.

Execution Steps

  1. Extract the text from the PDF file rfc/rfc-37/rfc-37.md:
from PyPDF2 import PdfReader

pdf = PdfReader('rfc/rfc-37/rfc-37.md')
text = pdf.pages[0].extract_text()
  1. Summarize the extracted text:
from languagetools import summarizer

summary = summarizer.summarize(text)
  1. Understand the issue described by the user:

The user is facing a problem with a Copy-On-Write (CoW) table that uses the BLOOM index type. They have noticed slower Spark jobs when performing upsert operations on the table. They are using Hudi version 0.12.2 and have enabled certain options for metadata. They have observed long gaps between jobs in the Spark UI and during these gaps, they see messages in the log indicating that all the commits since the table was created are being loaded. They are asking if this is expected behavior and if there are ways to improve it.

  1. Suggest relevant actions that should be taken:

Based on the issue described, it seems like the user might be facing performance issues due to the loading of all commits since the table was created. This could be due to the size of the table or the number of commits. Some potential actions that could be taken include:

  1. Provide the names of files that are relevant to the issue:

The file rfc/rfc-37/rfc-37.md seems to be relevant to the issue as it contains information about the Metadata based Bloom Index, which could potentially be related to the user's issue. The user might find it helpful to review this file to understand more about how the Bloom Index works in Hudi and if there are any known issues or considerations that could be affecting their use case.