jpmml / pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
95 stars 25 forks source link

py4j.Py4JException: Constructor org.jpmml.sparkml.PMMLBuilder does not exist #13

Closed DotaArtist closed 5 years ago

DotaArtist commented 6 years ago

I don't know why "Constructor org.jpmml.sparkml.PMMLBuilder" not exist.

spark-submit \
    --jars jpmml-sparkml-executable-1.4.5.jar \

py4j.Py4JException: Constructor org.jpmml.sparkml.PMMLBuilder([class org.apache.spark.sql.types.StructType, class org.apache.spark.ml.classification.LogisticRegression]) does not exist
        at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
        at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
        at py4j.Gateway.invoke(Gateway.java:235)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)
vruusmann commented 6 years ago

Your code is looking for a constructor PMMLBuilder(StructType, LogisticRegression) (note the second argument - LogisticRegression), which really does not exist.

However, there is a constructor PMMLBuilder(StructType, PipelineModel) (note the second argument - PipelineModel).

DotaArtist commented 6 years ago

sovled . Thanks very much for your reply in time ! @vruusmann

DotaArtist commented 6 years ago

another error happend when I use pipelineModel:

parsed_data = lib_svm.rdd.map(lambda line: (line[0], process_libsvm(line[1])))\
        .map(lambda x: Vectors.sparse(x[1][0], x[1][1], x[1][2]))
training = spark.createDataFrame(parsed_data, ["label", "features"])
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[lr])
pipelineModel = pipeline.fit(training)
pmmlBuilder = PMMLBuilder(spark.sparkContext, training, pipelineModel).putOption(lr, "compact", True)
pmmlBuilder.buildFile("lr.pmml")

type of features is vector.sparse.

training.show() is this:
+-----+--------------------+
|label|            features|
+-----+--------------------+
|    0|(12000,[0,1,2,3,8...|
|    1|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,6...|
|    0|(12000,[0,1,8,9,1...|
|    0|(12000,[0,1,2,3,6...|
|    0|(12000,[0,1,2,3,8...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,8...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    1|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
|    0|(12000,[0,1,2,3,4...|
+-----+--------------------+
Traceback (most recent call last):
File "/root/process_libsvm_ml.py", line 70, in load_lib_svm_v5
    pmmlBuilder.buildFile("lr.pmml")
File "/root/.local/lib/python2.7/site-packages/pyspark2pmml/__init__.py", line 28, in buildFile
    javaFile = self.javaPmmlBuilder.buildFile(javaFile)
  File "/opt/spark-hive-2.1.0/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/opt/spark-hive-2.1.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 79, in deco
pyspark.sql.utils.IllegalArgumentException: u'Expected string, integral, double or boolean type, got vector type'

I guess piplinemodel can not support vector type, but ml.classification.LogisticRegression can:

lr = LogisticRegression(maxIter=10, regParam=0.01)
model1 = lr.fit(training)

@vruusmann @yap770813901