scalapb / sparksql-scalapb

SparkSQL utils for ScalaPB
Apache License 2.0
43 stars 27 forks source link

NoSuchMethodError in sparksql35-scalapb0_11 after update #385

Open leoeareis opened 6 months ago

leoeareis commented 6 months ago

Hi! I made some updates in my project from Spark 3.4.1 to Spark 3.5.0 and updated the scalapb dependency from sparksql34-scalapb0_11 to sparksql35-scalapb0_11. After this upgrade, I faced this error:

 java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke.<init>(Ljava/lang/Class;Lorg/apache/spark/sql/types/DataType;Ljava/lang/String;Lscala/collection/Seq;Lscala/collection/Seq;ZZZ)V
    at scalapb.spark.ToCatalystHelpers.fieldToCatalyst(ToCatalystHelpers.scala:165)
    at scalapb.spark.ToCatalystHelpers.fieldToCatalyst$(ToCatalystHelpers.scala:107)
    at scalapb.spark.Implicits$$anon$1.fieldToCatalyst(TypedEncoders.scala:123)
    at scalapb.spark.ToCatalystHelpers.$anonfun$messageToCatalyst$2(ToCatalystHelpers.scala:39)

I run my jobs in a Databricks environment using Runtime 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12) and my udf that performs the protobuf decoder is defined as

import org.apache.log4j.Logger
import org.apache.spark.sql.Column
import scalapb.spark.ProtoSQL
import example.root.root.{Event => RootEvent}
import scalapb.spark.Implicits.{messageTypedEncoder, typedEncoderToEncoder}

import scala.util.{Failure, Success, Try}

object ProtobufExample extends Serializable {

  val logger: Logger = org.apache.log4j.LogManager.getLogger(this.getClass.getSimpleName)
  val rootDecoderUdf: Column => Column = ProtoSQL.udf(Protobuf.decodeRootEvent)

  def decodeRootEvent(input: Array[Byte]): Option[RootEvent] = {
    val result = Try {
      RootEvent.parseFrom(input)
    }
    result match {
      case Success(value) => Some(value)
      case Failure(e) =>
        logger.error(s"Decode error:", e)
        None
    }
  }
}

Could you help me how to figure out this error?

thesamet commented 6 months ago

Let's try to isolate the problem (is it related to Databricks environment?) and also make it reproducible so I can confirm a certain solution works. Can you try to reproduce the problem outside of Databricks environment?

It would also be really helpful if you can prepare and share a minimal project (can be based on https://github.com/thesamet/sparksql-scalapb-test) and try to reproduce it both in and outside databricks. Since it will also include some specific protos that causes failure maybe that would provide another direction

anamariavisan commented 2 months ago

Hello @thesamet! I tried to update a service to Databricks 14.2 and above that uses the sparksql35-scalapb0_11_2.12 dependency and I got the following error:

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke.<init>(Ljava/lang/Class;Lorg/apache/spark/sql/types/DataType;Ljava/lang/String;Lscala/collection/Seq;Lscala/collection/Seq;ZZZ)V
    at frameless.TypedEncoder$$anon$1.toCatalyst(TypedEncoder.scala:69)
    at frameless.RecordEncoder.$anonfun$toCatalyst$2(RecordEncoder.scala:155)
    at scala.collection.immutable.List.map(List.scala:293)
    at frameless.RecordEncoder.toCatalyst(RecordEncoder.scala:153)
    at frameless.TypedExpressionEncoder$.apply(TypedExpressionEncoder.scala:28)
    at scalapb.spark.Implicits.typedEncoderToEncoder(TypedEncoders.scala:119)
    at scalapb.spark.Implicits.typedEncoderToEncoder$(TypedEncoders.scala:116)
    at scalapb.spark.Implicits$.typedEncoderToEncoder(TypedEncoders.scala:122)

This doesn't happen locally. To your suggestion, I forked this repo https://github.com/thesamet/sparksql-scalapb-test/tree/master to see if the problem is related to the Databricks environment. The code can be found here: https://github.com/anamariavisan/sparksql-scalapb-test. To build the app I ran these commands:

curl -s "https://get.sdkman.io" | bash
sdk install java 11.0.24-zulu
sdk install sbt 1.6.2
sbt assembly

And to test it locally:

sdk install spark 3.5.0

spark-submit \
  --jars . \
  --class myexample.RunDemo \
  target/scala-2.12/sparksql-scalapb-test-assembly-1.0.0.jar

To test it in Databricks, I created a job and I uploaded the library target/scala-2.12/sparksql-scalapb-test-assembly-1.0.0.jar with the main class being myexample.RunDemo. I submitted the job locally and it worked, but in Databricks 14.2 and above, it failed with:

java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.objects.StaticInvoke.<init>(Ljava/lang/Class;Lorg/apache/spark/sql/types/DataType;Ljava/lang/String;Lscala/collection/Seq;Lscala/collection/Seq;ZZZ)V
    at scalapb.spark.ToCatalystHelpers.fieldToCatalyst(ToCatalystHelpers.scala:165)
    at scalapb.spark.ToCatalystHelpers.fieldToCatalyst$(ToCatalystHelpers.scala:107)
    at scalapb.spark.ProtoSQL$$anon$1$$anon$2.fieldToCatalyst(ProtoSQL.scala:84)
    at scalapb.spark.ToCatalystHelpers.$anonfun$messageToCatalyst$2(ToCatalystHelpers.scala:39)

I searched how to fix it and I found these issues that describe the same problem:

I also left a comment on this issue on the frameless repo https://github.com/typelevel/frameless/issues/787.

What is your action course on this matter for scalapb-sparksql?

thesamet commented 2 months ago

This is not actionable by sparksql-scalapb until there's a fix for frameless on Spark 3.5 and DBR 14.2.

chris-twiner commented 2 months ago

This is not actionable by sparksql-scalapb until there's a fix for frameless on Spark 3.5 and DBR 14.2.

fyi - the second stack is sparksql-scalapb internal and due to spark internal api usage rather than frameless itself. The proposed solution for frameless (#787) via shim could also be leveraged for the sparksql-scalapb api usage (tested across all supported DBRs for frameless usage at least).