Closed fractaloop closed 6 years ago
Apache Spark includes two random forest implementations:
1) Class RandomForest
in MLlib: https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/mllib/tree/RandomForest.html
2) Classes RandomForestClassifier
and RandomForestRegressor
in ML: https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/ml/classification/RandomForestClassifier.html and https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/ml/regression/RandomForestRegressor.html
As the name suggests, the JPMML-SparkML library is targeting the "ML" implementation. All JPMML-SparkML converters have full coverage with integration tests, and they are able to reproduce Apache Spark ML predictions within 1e-15 error margin.
If you are looking to export "MLlib" implemenation, then you should simply use Apache Spark's built-in PMMLExportable
trait.
I am only referring to the DataFrame based models in org.apache.spark.ml that JPMML-SparkML is designed for. Why does the SparkML source say it's majority vote, but average seems to work?
Why does the SparkML source say it's majority vote, but average seems to work?
Maybe you're being misled by the "votes" variable name.
The expression votes(i) += classCounts(i) / classCounts.sum
is performing the summation of tree probability distributions; the sum of probability distribution is finally divided by the number of trees. This can/should be interpreted as the "average" of probability distributions.
In that sense, the PMML language is capturing the intent of the algorithm better than Scala code.
When serializing a RandomForestClassificationModel (inside a PipelineModel), the resulting PMML uses
average
for the Segmentation instead ofmajorityVote
.Ref: RandomForestClassifier.scala