Open sctincman opened 6 years ago
The JPMML-SparkML library assumes that the label column of classification models is a "native" categorical label (in PMML, corresponds to a DataDictionary/DataField
element), not a "transformed" categorical label (corresponds to a TransformationDictionary/DerivedField
I've been taking it granted, and forgot to actually implement this "native" vs "transformed" check around
It's possible to make your example work, by applying the Binarize
transformation to the dataset outside of the pipeline, and then treating its output column "DepDelay_Bin" as a "native" categorical label:
binarizer = Binarizer(threshold=15.0, inputCol="DepDelay_Double", outputCol="DepDelay_Bin")
data2007 = binarizer.transform(data2007) # THIS!
stringIndexer = StringIndexer(inputCol="DepDelay_Bin", outputCol="DepDelay_Bin_Label") # THIS!
featuresAssembler = VectorAssembler(inputCols=["Month", "CRSDepTime", "Distance"], outputCol="features")
rfc3 = RandomForestClassifier(labelCol="DepDelay_Bin_Label", featuresCol="features", numTrees=3, maxDepth=5, seed=10305)
pipelineRF3 = Pipeline(stages=[stringIndexer, featuresAssembler, rfc3]) # THIS: start the pipeline with StringIndexer not Binarizer
model3 =
from jpmml_sparkml import toPMMLBytes
pmmlBytes = toPMMLBytes(sc, data2007, model3)
Technically, it shouldn't be much work to make JPMML-SparkML work with "transformed" labels, so keeping this issue open to track progress towards this functionality.
Looks like it can be closed for current version:
Binarizer binarizer = new Binarizer()
StringIndexer labelIndexer = new StringIndexer()
VectorAssembler vectorAssembler = new VectorAssembler()
.setInputCols(new String[]{
RandomForestClassifier classifier = new RandomForestClassifier()
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{binarizer, labelIndexer, vectorAssembler, classifier});
PipelineModel model =;
PMMLBuilder builder = new PMMLBuilder(schema, model);
final PMML build =;
JAXBUtil.marshalPMML(build, new StreamResult(System.out));
Looks like it can be closed for current version
Nope, I'd like to be able to use Sepal_Length_Binar_
as the label column here.
Can someone help me with this error: AttributeError: 'Pipeline' object has no attribute '_transfer_param_map_to_java' error. I get it when i try to execute the PMMLBuilder()
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')
paramGrid = (ParamGridBuilder()
.addGrid(dt.maxDepth, [1, 2, 6])
.addGrid(dt.maxBins, [570, 570])
stages += [dt]
pipeline = Pipeline(stages=stages)
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)
cvModel =
train_dataset = cvModel.transform(dataSet)
pmmlBuilder = PMMLBuilder(spark, dataSet, cvModel) \
.putOption(dt, "compact", True)
I cannot find any fix to this what I am doing wrong ?
AttributeError: 'Pipeline' object has no attribute '_transfer_param_map_to_java' error
This is clearly a low-level PySpark error, which has got nothing to do with PySpark2PMML or JPMML-SparkML.
Maybe your PySpark and Apache Spark versions are out of sync.
@vruusmann Thank you. My PySpark and Apache versions are up to date. The problem was you must pass the pipeline's bestmodel in my case cvModel.bestModel do the work.
@vruusmann Sorry for the off-topic i will delete the question but now i run into another issue when i try to buildFile from the pmmlBuilder object it says format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o57101.buildFile. : java.lang.IllegalArgumentException: Expected 3 target categories, got 2 target category, raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.IllegalArgumentException: 'Expected 3 target categories, got 2 target categories'. I cannot understand why do you have a clue ?
Running Spark 2.1.2, using jpmml-sparkml 1.2.7.
While attempting to run the following pyspark in order to convert a simple pipeline with a
model with eithertoPMMLByteArray
, I'm receiving the a NullPointerException.Following #22 I attempted to use the different Indexers on features and label columns to try and hint that these are categorical, but this resulted in the same error. Further, when I print the final tree, I do not see categorical feature declarations.
Dataset used, and tree output attached. rfc.txt