jpmml / pyspark2pmml

Python library for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
95 stars 25 forks source link

IllegalArgumentException: 'Expected string, integral, double or boolean type, got vector type' #18

Closed yashwanthmadaka24 closed 5 years ago

yashwanthmadaka24 commented 5 years ago
df = spark.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(input_dir+'stroke_100K_1.csv')
classifier = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=25, minInstancesPerNode=30, impurity="gini")
pipeline = Pipeline(stages = [classifier])
pipelineModel = pipeline.fit(trainingData)
from pyspark2pmml import PMMLBuilder
pmmlBuilder = PMMLBuilder(sc, trainingData, pipelineModel) \
    .putOption(classifier, "compact", True)
pmmlBuilder.buildFile("DecisionTree.pmml")

The above block of code throws the below error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o494.buildFile.
: java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type
    at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:169)
    at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:76)
    at org.jpmml.sparkml.ModelConverter.encodeSchema(ModelConverter.java:146)
    at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:169)
    at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:116)
    at org.jpmml.sparkml.PMMLBuilder.buildFile(PMMLBuilder.java:263)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-57-8881338fc369> in <module>()
      3 pmmlBuilder = PMMLBuilder(sc, trainingData, pipelineModel)      .putOption(classifier, "compact", True)
      4 
----> 5 pmmlBuilder.buildFile("DecisionTree.pmml")

~/.local/lib/python3.6/site-packages/pyspark2pmml/__init__.py in buildFile(self, path)
     24         def buildFile(self, path):
     25                 javaFile = self.sc._jvm.java.io.File(path)
---> 26                 javaFile = self.javaPmmlBuilder.buildFile(javaFile)
     27                 return javaFile.getAbsolutePath()
     28 

/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: 'Expected string, integral, double or boolean type, got vector type'

According to the documentation (https://spark.apache.org/docs/1.5.2/ml-decision-tree.html), the parameters given to a decision tree are "labelCol" which is of the type "double" and "featuresCol" which is a vector type. My trainingData also contains the exact same format. Is there any support for vectors?

vruusmann commented 5 years ago

the parameters given to a decision tree are "labelCol" which is of the type "double" and "featuresCol" which is a vector type.

JPMML-SparkML supports only a subset of Apache Spark ML types.

Specifically, the vector type is not supported. A vector column must be expanded into scalar columns, for example, by applying the VectorIndexer transformation.

chinyii commented 3 years ago

@yashwanthmadaka24 Hello! I know this thread is kind of old, but do you happen to remember how you solved this particular issue? I tried VectorIndexer but even after that it shows up as a vector. It is most likely that I did it wrongly, so I just want to know how you did it in particular.

wjunneng commented 2 years ago

@yashwanthmadaka24 Hello! I know this thread is kind of old, but do you happen to remember how you solved this particular issue? I tried VectorIndexer but even after that it shows up as a vector. It is most likely that I did it wrongly, so I just want to know how you did it in particular.

input col -> cast('double')