jpmml / pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
95 stars 25 forks source link

Read in model without spark context #36

Closed DuongVu39 closed 2 years ago

DuongVu39 commented 2 years ago

I created a SparkML Pipeline and save it out as instructed in the README example. When reading the pmml model in, will I be able to do so without spark context?

The example reading the model in from README.md required the spark context as below:

from pyspark2pmml import PMMLBuilder

classifierModel = pipelineModel.stages[1]

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
    .putOption(classifierModel, "compact", False) \
    .putOption(classifierModel, "estimate_featureImportances", True)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")
vruusmann commented 2 years ago

When reading the pmml model in, will I be able to do so without spark context?

The spark context is typically needed for loading the pipeline object from file into memory:

SparkSession sparkSession = ...;

MLReader<PipelineModel> mlReader = new PipelineModel.PipelineModelReader();
// THIS!
mlReader.session(sparkSession);

PipelineModel pipelineModel = mlReader.load(tmpPipelineDir.getAbsolutePath());

After that, the underlying Java converter component org.jpmml.sparkml.PMMLBuilder has no extra use for it.

The example reading the model in from README.md required the spark context as below

This example is about using JPMML-SparkML in PySpark environment. If you check the source code of pyspark2pmml.PMMLBuilder class, you can see that the spark context is used for obtaining a handle to the active JVM runtime (sc._jvm.).

TLDR: Figure out your exact application scenario - are you working in Java/Scala, or Python/PySpark? Do you already have an Apache Spark instance running somewhere at the time of conversion?

You can always start a local/temporary spark context if nothing else works.