jpmml / jpmml-evaluator-spark

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
GNU Affero General Public License v3.0
94 stars 43 forks source link

support for spark 3.x ? #43

Closed lcx517 closed 2 years ago

lcx517 commented 2 years ago

Hi @vruusmann,

Is there any plan to support evaluator in spark 3.x?

Or maybe I could attempt to build jpmml-evaluator-spark on spark 3.x on my own?

Thank you

vruusmann commented 2 years ago

Is there any plan to support evaluator in spark 3.x?

The underlying JPMML-Evaluator library, and this JPMML-Evaluator-Python wrapper library are both written in the Java language and should therefore be totally agnostic towards Scala and Apache Spark ML versions.

According to GitHub log, I haven't touched this codebase for three years. I wonder, what has changed/broken API-wise in this timeframe?

Or maybe I could attempt to build jpmml-evaluator-spark on spark 3.x on my own?

I haven't marked this codebase as "Archived", so I do have some interest in reviving it. But it's not a high-priiority item for me personally.

Please try to deploy the current version on your target Apache Spark ML version (3.2.X perhaps?), and report back all the issues that you're experiencing. Also, if you can suggest immediate fixes to those issues, please do share those as well.

lcx517 commented 2 years ago

I rebuilt project on Spark 3.1.1, now It's successful to run pmml-evaluator on Spark 3.1.1. I have a pull request https://github.com/jpmml/jpmml-evaluator-spark/pull/44 for this version.

There are several compatibility problems I encountered. The last one has not solved yet.

  1. untyped Scala UDF
    
    ERROR Instrumentation: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
  2. use typed Scala UDF APIs(without return type parameter), e.g. udf((x: Int) => x)
  3. use Java UDF APIs, e.g. udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType), if input types are all non primitive
  4. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution at org.apache.spark.sql.functions$.udf(functions.scala:5021) at org.jpmml.evaluator.spark.PMMLTransformer.transform(PMMLTransformer.scala:99) at org.apache.spark.ml.PipelineModel.$anonfun$transform$4(Pipeline.scala:311) at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146) at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139) at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel.$anonfun$transform$3(Pipeline.scala:311) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) at org.apache.spark.ml.PipelineModel.$anonfun$transform$2(Pipeline.scala:310) at org.apache.spark.ml.MLEvents.withTransformEvent(events.scala:146) at org.apache.spark.ml.MLEvents.withTransformEvent$(events.scala:139) at org.apache.spark.ml.util.Instrumentation.withTransformEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel.$anonfun$transform$1(Pipeline.scala:308) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:307)
    
    I got from Spark Migration guide page: 
    > In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default.

And I googled the solution by adding:

sparkSession.sql("set spark.sql.legacy.allowUntypedScalaUDF=true")
  1. When output columns String contains ".", transform function will run escapeColumnName(name) and add back quote to column name, which may cause error like :
    org.apache.spark.sql.AnalysisException: No such struct field `probability(0.0)` in y, pmml(prediction), prediction, probability(0.0), probability(1.0)

    My solution is not adding back quote for column name, instead, replace back quote with underline. this modification is not in my pull request, since I have no better idea for this problem.

vruusmann commented 2 years ago
  1. untyped Scala UDF

In JPMML-Evaluator 1.6.X development branch, the signature of the main evaluation method was changed to:

Map<String, ?> evaluate(Map<String, ?> arguments);

The value type of both arguments and results map is java.lang.Object. In the Java land, it is impossible to insert a primitive value (eg. int, double) into such Map. In don't know if in Scala land it is possible or not.

The main point is that the UDF should keep null references unchanged (instead of replacing them with primitive-like 0 or 0.0 values), because the JPMML-Evaluator uses the null reference for denoting missing values.

Ideally, the Apache Spark UDF could have a signature that states: "send Map<String, Object> in, and get Map<String, Object> back. If there are any null values in the arguments or results maps, keep them as-is".

When output columns String contains ".", transform function will run escapeColumnName(name) and add back quote to column name.

The org.jpmml.evaluator.ModelEvaluatorBuilder class has setResultMapper(org.jpmml.evaluator.ResultMapper) method, which lets you "customize" result field names on the fly.

In the current case, you could replace the problematic dot character (.) with some other character, such as the underscore character (_), or delete it altogether:

ModelEvaluatorBuilder evaluatorBuilder = new ModelEvaluatorBuilder(...)
  .setResultMapper(new ResultMapper(){
    @Override
    public String apply(String pmmlName){
      return pmmlName.replace(".", "_");
    }
  });

IIRC, the whole model evaluator builder patter wasn't properly integrated into this codebase (three years ago). I should do it here and now.

vruusmann commented 2 years ago

This issue is by no means done (aka closed) - I haven't written a single line of code yet!

lcx517 commented 2 years ago

Oh.. sorry, I'm looking forwards your new version~