numberlabs-developers / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
0 stars 0 forks source link

[SUPPORT]MOR Hudi failed to upsert after upgrading EC2 instance. #92

Open numberlabs-developers opened 10 months ago

numberlabs-developers commented 10 months ago

Describe the problem you faced

We encountered Errors in the production pipeline yesterday when we increased our EC2 instance type from c7g.4xlarge to c7g.8xlarge

To Reproduce

Steps to reproduce the behavior: Upgrade EC2 instance to c6i.8xlarge and increased spark memory

Expected behavior

Job runs successfully with improved performance.

Environment Description

Hudi version : 0.13.0

Spark version : 3.3

Storage (HDFS/S3/GCS..) : S3

Running on Docker? (yes/no) : No

Additional context

Once we reverted back the EC2 upgrade, the job started working properly again.

Stacktrace

PFA the change we did and the errors we got: -

Case 1.txt Case 2.txt

torvalds-dev-testbot[bot] commented 10 months ago

torvalds.dev is analyzing the ticket

torvalds-dev-testbot[bot] commented 10 months ago

The issue mentioned here is related to the Apache Hudi project. Apache Hudi is a framework used to manage large analytical datasets stored in distributed storage systems. It offers several capabilities including upsert support, atomic data publishing, rollback support, and snapshot isolation among others.

Looking at the problem description, the user is encountering errors for a specific job in their production pipeline after they increased the Amazon EC2 instance type from c7g.4xlarge to c7g.8xlarge. Additionally, they mentioned that they had increased the spark memory as part of this process.

The user was expecting the job to run successfully with improved performance. However, that was not the case. When they reverted the EC2 upgrade, the job started working again, which indicates the issue is likely related to the change in EC2 instance type or the increase in spark memory, or perhaps a combination of both.

The provided stack traces (Case 1.txt and Case 2.txt) would be crucial in troubleshooting the issue. Depending on the details in these stack traces, the user might need to investigate a potential memory allocation issue, investigate differences between EC2 instances of different types, or possibly look into the job's configuration in relation to the updated Spark and Hudi versions.

Relevant files for this issue include the two attached stack trace files (Case 1.txt and Case 2.txt). Additionally, given that the problem is related to the Apache Hudi project, the project's README.md file could provide helpful context. Furthermore, any configuration files or logs related to the problematic job could also be useful in diagnosis.

As for the relevant actions:

  1. Analyze the stack trace files to pinpoint the error.
  2. Investigate differences in performance and configuration between the c7g.4xlarge and c7g.8xlarge EC2 instances.
  3. Verify that the increased spark memory is appropriately allocated and does not exceed the limits of the new EC2 instance type.
  4. Check the job's configuration settings with the upgraded Spark and Hudi versions.
  5. If the issue still persists, consider reaching out to the Apache Hudi community for support.

Remember to always take a backup before making any major change in your production environment.