RandomForestClassificationModels use incorrect multipleModelMethod

fractaloop commented 6 years ago

When serializing a RandomForestClassificationModel (inside a PipelineModel), the resulting PMML uses average for the Segmentation instead of majorityVote.

Ref: RandomForestClassifier.scala

vruusmann commented 6 years ago

Apache Spark includes two random forest implementations: 1) Class RandomForest in MLlib: https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/mllib/tree/RandomForest.html 2) Classes RandomForestClassifier and RandomForestRegressor in ML: https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/ml/classification/RandomForestClassifier.html and https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/ml/regression/RandomForestRegressor.html

As the name suggests, the JPMML-SparkML library is targeting the "ML" implementation. All JPMML-SparkML converters have full coverage with integration tests, and they are able to reproduce Apache Spark ML predictions within 1e-15 error margin.

If you are looking to export "MLlib" implemenation, then you should simply use Apache Spark's built-in PMMLExportable trait.

fractaloop commented 6 years ago

I am only referring to the DataFrame based models in org.apache.spark.ml that JPMML-SparkML is designed for. Why does the SparkML source say it's majority vote, but average seems to work?

vruusmann commented 6 years ago

Why does the SparkML source say it's majority vote, but average seems to work?

Maybe you're being misled by the "votes" variable name.

The expression votes(i) += classCounts(i) / classCounts.sum is performing the summation of tree probability distributions; the sum of probability distribution is finally divided by the number of trees. This can/should be interpreted as the "average" of probability distributions.

In that sense, the PMML language is capturing the intent of the algorithm better than Scala code.

jpmml / jpmml-sparkml

RandomForestClassificationModels use incorrect multipleModelMethod #36