jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type #26

Closed JuMan0603 closed 7 years ago

JuMan0603 commented 7 years ago


I encountered some problems when using the JPMML model transformation. This is my data source: val trainingDataFrame ="libsvm").load(libsvmDataPath).toDF("label", "features") The schema of "trainingDataFrame" contains the VectorUDT type, so when I use ConverterUtil.toPMML (newSchema, loadedModel), it will prompt java.lang.IllegalArgumentException. Here is the code:

  val training ="libsvm").load(libsvmDataPath).toDF("label", "features")

  val vi = new VectorIndexer()

   val pca = new PCA()

   val lr = new LogisticRegression()

    val pipeline = new Pipeline().setStages(Array(vi, pca, lr))

    val model =

    println("traing dataframe's schema is:  " + training.schema.mkString)
    val schema = training.schema
    val pmml = ConverterUtil.toPMML(schema, model)
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out))

The full stack trace is:

|label|            features|
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|
|  1.0|(692,[152,153,154...|
|  1.0|(692,[151,152,153...|
|  0.0|(692,[129,130,131...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[99,100,101,...|
|  0.0|(692,[154,155,156...|
|  0.0|(692,[127,128,129...|
only showing top 10 rows

traing dataframe's schema is:   
Exception in thread "main" java.lang.IllegalArgumentException: Expected string, integral, double or boolean type, got vector type
    at org.jpmml.sparkml.SparkMLEncoder.createDataField(
    at org.jpmml.sparkml.SparkMLEncoder.getFeatures(
    at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(
    at org.jpmml.sparkml.FeatureConverter.registerFeatures(
    at org.jpmml.sparkml.ConverterUtil.toPMML(
    at com.myhexin.oryx.batchlayer.TestPMML$.trainModel(TestPMML.scala:138)
    at com.myhexin.oryx.batchlayer.TestPMML$.main(TestPMML.scala:29)
    at com.myhexin.oryx.batchlayer.TestPMML.main(TestPMML.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.lang.reflect.Method.invoke(
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:743)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

What should I do to solve this VectorUDT unsupported problem?

vruusmann commented 7 years ago

Duplicate of, and

The resolution is still the same - From the PMML perspective, vector columns are under-specified and cannot/won't be supported

val trainingDataFrame ="libsvm").load(libsvmDataPath).toDF("label", "features")

LibSVM is a vector-oriented data format. Please load your dataset from some non vector-oriented data format (such as CSV).

JuMan0603 commented 7 years ago

I solved it, thank you!

damonjun512 commented 5 years ago

can you show zhe detail