Open numberlabs-developers opened 10 months ago
torvalds.dev is analyzing the ticket
The issue mentioned here is related to the Apache Hudi project. Apache Hudi is a framework used to manage large analytical datasets stored in distributed storage systems. It offers several capabilities including upsert support, atomic data publishing, rollback support, and snapshot isolation among others.
Looking at the problem description, the user is encountering errors for a specific job in their production pipeline after they increased the Amazon EC2 instance type from c7g.4xlarge to c7g.8xlarge. Additionally, they mentioned that they had increased the spark memory as part of this process.
The user was expecting the job to run successfully with improved performance. However, that was not the case. When they reverted the EC2 upgrade, the job started working again, which indicates the issue is likely related to the change in EC2 instance type or the increase in spark memory, or perhaps a combination of both.
The provided stack traces (Case 1.txt and Case 2.txt) would be crucial in troubleshooting the issue. Depending on the details in these stack traces, the user might need to investigate a potential memory allocation issue, investigate differences between EC2 instances of different types, or possibly look into the job's configuration in relation to the updated Spark and Hudi versions.
Relevant files for this issue include the two attached stack trace files (Case 1.txt and Case 2.txt). Additionally, given that the problem is related to the Apache Hudi project, the project's README.md file could provide helpful context. Furthermore, any configuration files or logs related to the problematic job could also be useful in diagnosis.
As for the relevant actions:
Remember to always take a backup before making any major change in your production environment.
Describe the problem you faced
We encountered Errors in the production pipeline yesterday when we increased our EC2 instance type from c7g.4xlarge to c7g.8xlarge
To Reproduce
Steps to reproduce the behavior: Upgrade EC2 instance to c6i.8xlarge and increased spark memory
Expected behavior
Job runs successfully with improved performance.
Environment Description
Hudi version : 0.13.0
Spark version : 3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No
Additional context
Once we reverted back the EC2 upgrade, the job started working properly again.
Stacktrace
PFA the change we did and the errors we got: -
Case 1.txt Case 2.txt