jpmml / pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
95 stars 25 forks source link

AttributeError: 'PMMLPipeline' object has no attribute '_to_java' #11

Closed okoutb closed 6 years ago

okoutb commented 6 years ago

I trained a RandomForestRegressor and I want to export it to PMML but I have an error:


from pyspark.ml import Pipeline
from pyspark2pmml import PMMLBuilder
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.regression import RandomForestRegressor
from pyspark import sql, SparkConf, SparkContext

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.load(tsv_data_path, format="csv", sep=",", header="true")
df_n = df.rdd.map(lambda row: LabeledPoint(row[label_idx], row[:label_idx-1])).toDF(["features", "label"])
as_ml = udf(lambda v: v.asML() if v is not None else None, VectorUDT())
result = df_n.withColumn("features", as_ml("features"))
(trainingData1, testData1) = result.randomSplit([0.8, 0.2])

rf = RandomForestRegressor(numTrees=2, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=2, maxBins=32)
pipelineSpark = Pipeline(stages = [rf])
pipelineModelSpark = pipeline.fit(trainingData1)

pmmlBuilder = PMMLBuilder(spark, trainingData1, pipelineModelSpark).putOption(regressor, "compact", True)

And the error is:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-64-b1c78ce59837> in <module>()
      1 from pyspark2pmml import PMMLBuilder
      2 
----> 3 pmmlBuilder = PMMLBuilder(spark, trainingData1, pipelineModelSpark).putOption(regressor, "compact", True)

~/.local/lib/python3.6/site-packages/pyspark2pmml/__init__.py in __init__(self, sc, df, pipelineModel)
     12                 javaSchema = javaDf.schema.__call__()
     13                 javaPipelineModel = pipelineModel._to_java()
---> 14                 javaPmmlBuilder = sc._jvm.org.jpmml.sparkml.PMMLBuilder(javaSchema, javaPipelineModel)
     15                 if(not isinstance(javaPmmlBuilder, JavaObject)):
     16                         raise RuntimeError("JPMML-SparkML not found on classpath")

TypeError: 'JavaPackage' object is not callable

What could the reason be?

vruusmann commented 6 years ago

What could the reason be?

Very difficult to say what's going on, because your sample code doesn't match the exception.

AttributeError: 'PMMLPipeline' object has no attribute '_to_java'

Where is class PMMLPipeline defined? There is such a class in the sklearn2pmml package, but you cannot use a Scikit-Learn wrapper class with Apache Spark.

PMMLBuilder(spark, trainingData1, pipeline)

The third argument must be of type org.apache.spark.ml.PipelineModel. Therefore, replace pipeline with pipelineModel there.

The README file of this project provides a complete and correct example. Please run it first, in order to get a better understanding how things should be put together.

okoutb commented 6 years ago

@vruusmann Sorry I did have a mistake in copying, please see my edit with the correct error

vruusmann commented 6 years ago

TypeError: 'JavaPackage' object is not callable

It typically means that Java class org.jpmml.sparkml.PMMLBuilder is not available on your PySpark session's classpath.

However, I'm not sure if "typically" applies in the current case, because the state of your (current-) PySpark session is completely messed up (there are probably JPMML-SparkML and JPMML-SkLearn classes together).

Again, please do exactly as the PySpark2PMML package README file tells you to do (starting with a fresh PySpark session). For as long as you haven't got this basic exercise right, there is absolutely no point in trying to do anything else/more complex.