jpmml / jpmml-evaluator-spark

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
GNU Affero General Public License v3.0
94 stars 43 forks source link

Support for spark version 3.5.0? #48

Open khanjandharaiya opened 4 months ago

khanjandharaiya commented 4 months ago

Hey there! I am using latest version 1.3.0 of jpmml-evaluator-spark but after upgrading to the latest spark version 3.5.0. i am getting this error:

untyped Scala UDF

ERROR org.apache.spark.ml.util.Instrumentation - org.apache.spark.sql.AnalysisException: [UNTYPED_SCALA_UDF] You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`.
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive.
3. set "spark.sql.legacy.allowUntypedScalaUDF" to "true" and use this API with caution.
    at org.apache.spark.sql.errors.QueryCompilationErrors$.usingUntypedScalaUDFError(QueryCompilationErrors.scala:3157)
    at org.apache.spark.sql.functions$.udf(functions.scala:8299)
    at org.jpmml.evaluator.spark.PMMLTransformer.transform(PMMLTransformer.scala:99)
    at org.apache.spark.ml.PipelineModel.$anonfun$transform$4(Pipeline.scala:311)
    at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146)
    at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139)
    at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42)
    at org.apache.spark.ml.PipelineModel.$anonfun$transform$3(Pipeline.scala:311)
    at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
    at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
    at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
    at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:310)
    at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146)
    at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139)
    at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42)
    at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308)
    at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
    at scala.util.Try$.apply(Try.scala:213)
    at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
    at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307)

After using "spark.sql.legacy.allowUntypedScalaUDF", "true" its working fine.

Is there will any update from your side to solve this?

I found this related closed issue: https://github.com/jpmml/jpmml-evaluator-spark/issues/43 for spark version 3.1.1