jpmml / pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
95 stars 25 forks source link

PyPMML is hiding secondary result fields #37

Closed jtzhang17 closed 2 years ago

jtzhang17 commented 2 years ago

I have a PySpark XGBoost pipelineModel, and it was saved as PMML in the following way:

pipelineModel = Pipeline(stages=pipeline_stages).fit(df)
pmml_builder = PMMLBuilder(sc, df, pipelineModel)
pmml_builder.buildFile("trained_xgb_model.pmml")

The saved PMML model was loaded using the pypmml-spark package, and a testing data set was applied to the loaded model. However, the final results always contain one prediction column, but never include the probability or rawPrediction columns.

from pypmml_spark import ScoreModel

model = ScoreModel.fromFile(model_name)
df_pred = model.transform(df_test)
df_pred.show(5)

Can someone share me an example that the saved model from pyspark2pmml can produce the probability column in the model evaluation results?

vruusmann commented 2 years ago

The saved PMML model was loaded using the pypmml-spark package

The PyPMML library suppresses secondary result fields by default. I have zero control over this behaviour.

Can someone share me an example that the saved model from pyspark2pmml can produce the probability column in the model evaluation results?

Please take your issue to the PyPMML project. It does not belong to here.

jtzhang17 commented 2 years ago

Could you please explain a little bit more about PyPMML library suppresses secondary result fields by default? You mean the probability column is a secondary result? What does secondary result mean? Thanks!

The saved PMML model was loaded using the pypmml-spark package

The PyPMML library suppresses secondary result fields by default. I have zero control over this behaviour.