[SUPPORT]AWS EMR 6.7.0 & HUDI MoR Write

numberlabs-developers commented 9 months ago

I'm getting

3/08/29 08:32:51 ERROR Client: Application diagnostics message: User class threw exception: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildSplits$2(MergeOnReadSnapshotRelation.scala:127) at scala.Option.map(Option.scala:230) at org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildSplits$1(MergeOnReadSnapshotRelation.scala:125) I'm using EMR 6.7.0 and these libraries on my .jar (program)

implementation "org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.1"
implementation "org.scala-lang:scala-library:2.12.8"
implementation "org.apache.spark:spark-core_2.12:3.2.1"
implementation "org.apache.spark:spark-sql_2.12:3.2.1"
implementation "org.apache.hadoop:hadoop-aws:3.2.1"

It is interesting because I've just changed the input which the program reads (and the code of the program for reading MoR of course ;) ) and the program writes on MoR.

If it reads CoW files from S3, it works. If it reads MoW files from S3, it throws the exception from above Any clue? I've seen that people are suggesting to use EMR 6.9.0... https://github.com/apache/hudi/issues/8903#issuecomment-1624977292 but I would like to see if the issue could be resolved on EMR 6.7.0 so that I don't have to upgrade the whole libraries from my project :/

Environment Description

Hudi version : 0.11.1

Spark version : 3.2.1

Hadoop version : 3.2.1

Storage (HDFS/S3/GCS..) : S3

Running on Docker? (yes/no) : no

torvalds-dev-testbot[bot] commented 9 months ago

torvalds.dev is analyzing the ticket

torvalds-dev-testbot[bot] commented 9 months ago

The user is experiencing a java.lang.NoSuchMethodError when trying to do Merge-on-Read (MoR) writes with Hudi on AWS EMR version 6.7.0. This issue is present when reading MoW files from S3, but not when reading Copy-on-Write (CoW) files. The user is trying to avoid upgrading to EMR 6.9.0, which has been suggested as a solution in a GitHub thread.

Their environment includes:

Hudi version: 0.11.1
Spark version: 3.2.1
Hadoop version: 3.2.1
Storage: S3

Relevant files provided by the user:

packaging/hudi-utilities-slim-bundle/README.md: This readme file contains instructions on using the hudi-utilities-slim-bundle, which is meant to be used together with Hudi Spark bundle if using hudi-utilities-bundle solely introduces problems for a specific Spark version. It includes examples of running Hudi Streamer for Spark versions 2.4.7, 3.1.2, and 3.2.0.
README.md: This is the main readme file for Apache Hudi, providing a general overview of Hudi and its capabilities, along with instructions for building from source, running tests, and a quick start guide.

Possible actions to take:

Review the methods being used during the write operation, focusing on org.apache.spark.sql.execution.datasources.PartitionedFile.<init> and org.apache.hudi.MergeOnReadSnapshotRelation.$anonfun$buildSplits as they are indicated in the error message.
Check if the versions of Hudi, Spark, and Hadoop in use are compatible with each other and with EMR 6.7.0.
Test the write operation with different versions of Hudi, Spark, or Hadoop to see if the issue persists.
Review the usage of the hudi-utilities-slim-bundle, as detailed in its readme file, to ensure it's being used correctly.
If possible, test with EMR 6.9.0 (in a development or test environment, not production) to confirm if upgrading EMR does in fact resolve the issue.
Reach out to the Hudi community or AWS support for further help diagnosing and resolving the issue.

The user might also want to follow the Apache Hudi project on GitHub, where they can get updates on issues and pull requests that could be relevant to their problem.

numberlabs-developers / hudi

[SUPPORT]AWS EMR 6.7.0 & HUDI MoR Write #76