jpmml / jpmml-evaluator-spark

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)
GNU Affero General Public License v3.0
94 stars 43 forks source link

Add support for Spark 2.0 #4

Closed viirya closed 7 years ago

viirya commented 7 years ago

This PR is to add support Spark 2.0. I have tested with Spark 2.0.2 release like:

bin/spark-submit --master local --class org.jpmml.spark.EvaluationExample pmml-spark-example/target/example-1.0-SNAPSHOT.jar DecisionTreeIris.pmml Iris.csv /tmp/DecisionTreeIris

And it works.

viirya commented 7 years ago

ping @vruusmann Can you take a look this if it is good to merge? Thanks.

vruusmann commented 7 years ago

I don't want to jump to Spark 2.0.X until there's no official release for Spark 1.5.X/1.6.X available. By "official release" I mean something that has stable API (in terms of TransformerBuilder functionality), and has been pushed to Maven Central repository. The goal is to minimize the difference between Spark 1.5.X/1.6.X and 2.0.X codebases, so that it would be easier to keep them in sync for extended periods of time.

Some things that need more attention/work:

  1. The internals of the TransformerBuilder class are too complex. After the JPMML-Evaluator dependency has been upgraded to 1.3.3 (which relaxes the visibility of org.jpmml.evaluator.ModelField subclasses), then it will be possible to collapse many of org.jpmml.spark.*ColumnProducer classes.
  2. The returned Transformer instance should implement some kind of interface HasModelFields, which would let you easily query the names/types of input and result columns. For example, HasModelFields#getInputCols() and HasModelFields#getResultCols(). You can use this information to check if the Transformer object is "logically compatible" with the argument DataFrame object or not.
  3. Classes PMMLTransformer and ColumnPruner should be translated from Java to Scala (IIRC, latest versions of Apache Spark should have a reusable ColumnPruner transformation already built-in). This involves tweaking the Apache Maven build (eg. to invoke scalac compiler).

I can easily do the first two items. However, I will have difficulties with the third item, because my working experience with Scala is very minimal. If you want to help keep things moving, then you could submit another PR in that area.

viirya commented 7 years ago

OK. I got it. I'd close this now.