Open torvalds-dev-testbot[bot] opened 10 months ago
torvalds.dev is analyzing the ticket
rfc/rfc-37/rfc-37.md
.rfc/rfc-37/rfc-37.md
:from PyPDF2 import PdfReader
pdf = PdfReader('rfc/rfc-37/rfc-37.md')
text = pdf.pages[0].extract_text()
from languagetools import summarizer
summary = summarizer.summarize(text)
The user is facing a problem with a Copy-On-Write (CoW) table that uses the BLOOM index type. They have noticed slower Spark jobs when performing upsert operations on the table. They are using Hudi version 0.12.2
and have enabled certain options for metadata. They have observed long gaps between jobs in the Spark UI and during these gaps, they see messages in the log indicating that all the commits since the table was created are being loaded. They are asking if this is expected behavior and if there are ways to improve it.
Based on the issue described, it seems like the user might be facing performance issues due to the loading of all commits since the table was created. This could be due to the size of the table or the number of commits. Some potential actions that could be taken include:
The file rfc/rfc-37/rfc-37.md
seems to be relevant to the issue as it contains information about the Metadata based Bloom Index, which could potentially be related to the user's issue. The user might find it helpful to review this file to understand more about how the Bloom Index works in Hudi and if there are any known issues or considerations that could be affecting their use case.
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced Hello! :wave:
We have a CoW table which is using the BLOOM index type. We started seing slower Spark jobs when doing upsert to the table. We are using Hudi
0.12.2
The following options are enabled for metadata :