jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type #26

Closed JuMan0603 closed 7 years ago

JuMan0603 commented 7 years ago

Hello,

I encountered some problems when using the JPMML model transformation. This is my data source: val trainingDataFrame = spark.read.format("libsvm").load(libsvmDataPath).toDF("label", "features") The schema of "trainingDataFrame" contains the VectorUDT type, so when I use ConverterUtil.toPMML (newSchema, loadedModel), it will prompt java.lang.IllegalArgumentException. Here is the code:

  val training = spark.read.format("libsvm").load(libsvmDataPath).toDF("label", "features")

  val vi = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexed")
      .setMaxCategories(693)

   val pca = new PCA()
      .setInputCol("features")
      .setOutputCol("pcaFeatures")
      .setK(3)

   val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.3)
      .setElasticNetParam(0.8)
      .setProbabilityCol("myProbability")

    val pipeline = new Pipeline().setStages(Array(vi, pca, lr))

    val model = pipeline.fit(training)

    model.write.overwrite().save(modelSavePath)

    training.show(10)
    println("==========================")
    println("traing dataframe's schema is:  " + training.schema.mkString)
    println("==========================")
    val schema = training.schema
    val pmml = ConverterUtil.toPMML(schema, model)
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))

The full stack trace is:

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
+-----+--------------------+
only showing top 10 rows

==========================
traing dataframe's schema is:   
StructField(label,DoubleType,true)StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)
==========================
Exception in thread "main" java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type
    at org.jpmml.sparkml.SparkMLEncoder.createDataField(SparkMLEncoder.java:160)
    at org.jpmml.sparkml.SparkMLEncoder.getFeatures(SparkMLEncoder.java:73)
    at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:56)
    at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:47)
    at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:75)
    at com.myhexin.oryx.batchlayer.TestPMML$.trainModel(TestPMML.scala:138)
    at com.myhexin.oryx.batchlayer.TestPMML$.main(TestPMML.scala:29)
    at com.myhexin.oryx.batchlayer.TestPMML.main(TestPMML.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

What should I do to solve this VectorUDT unsupported problem?

vruusmann commented 7 years ago

Duplicate of https://github.com/jpmml/jpmml-sparkml/issues/2, https://github.com/jpmml/jpmml-sparkml/issues/18 and https://github.com/jpmml/jpmml-sparkml/issues/21

The resolution is still the same - From the PMML perspective, vector columns are under-specified and cannot/won't be supported

val trainingDataFrame = spark.read.format("libsvm").load(libsvmDataPath).toDF("label", "features")

LibSVM is a vector-oriented data format. Please load your dataset from some non vector-oriented data format (such as CSV).

JuMan0603 commented 7 years ago

I solved it, thank you!

damonjun512 commented 5 years ago

can you show zhe detail