jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Support multi-column transformation modes (in newer Apache Spark versions) #78

Closed brightzhong closed 4 years ago

brightzhong commented 5 years ago

my code is :

// features for xgboost , all_vec_f is array of string (columns )
    val vectorAssembler4gbdt = new VectorAssembler().
      setInputCols(Array(
        all_f:_*
      )).
      setOutputCol("feat4gbdt") ;

    //  features for lr 
    val vectorAssembler4lr = new VectorAssembler().
      setInputCols(Array(
        all_vec_f:_*
      )).
      setOutputCol("feat4lr") ;

    //  discretizer
    val discretizer = new QuantileDiscretizer()
      .setInputCols(Array( all_f:_* ))
      .setOutputCols(Array( all_dis_f:_* ) )
      .setNumBuckets(50)

    // onehot encode
    val encoder = new OneHotEncoderEstimator()
      .setHandleInvalid("keep")
      .setInputCols(Array( all_dis_f:_* ))
      .setOutputCols(Array( all_vec_f:_* ))

    // pipepline model
    println(" pipeline...")
    val preProDataPipeline = new Pipeline()
      .setStages(Array(discretizer,encoder,vectorAssembler4gbdt,vectorAssembler4lr  ))

    val pipelineModel = preProDataPipeline.fit(dataset=dataSet) ; 

val pipePathPmml= "xxpathxxx" ;
    val pMMLBuilder = new PMMLBuilder(dataSet.schema, pipelineModel) ;  

// get error here 
MetroJAXBUtil.marshalPMML(  pMMLBuilder.build(), new FileOutputStream(pipePathPmml)) ;

error:

ERROR ApplicationMaster: User class threw exception: java.util.NoSuchElementException: Failed to find a default value for inputCol
java.util.NoSuchElementException: Failed to find a default value for inputCol
    at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
    at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779)
    at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
    at org.apache.spark.ml.param.Params$class.$(params.scala:786)
    at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
    at org.apache.spark.ml.param.shared.HasInputCol$class.getInputCol(sharedParams.scala:221)
    at org.apache.spark.ml.feature.Bucketizer.getInputCol(Bucketizer.scala:48)
    at org.jpmml.sparkml.feature.BucketizerConverter.encodeFeatures(BucketizerConverter.java:48)
    at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
    at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:110)
    at com.tencent.tdw.spark.CTRModel.sparkGbdtLr$.main(sparkGbdtLr.scala:601)
    at com.tencent.tdw.spark.CTRModel.sparkGbdtLr.main(sparkGbdtLr.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:727)

I get the above exception when i saving pmml to a file path . Any ideas on how to resolve this? Please comment if further details are needed.

brightzhong commented 5 years ago

spark version : 2.3.1 pmml version :1.4.11

vruusmann commented 5 years ago

java.util.NoSuchElementException: Failed to find a default value for inputCol

You're using QuantileDiscretizer in multi-column mode (QuantileDiscretizer#setInputCols), which isn't currently supported. Please switch to single-column mode (QuantileDiscretizer#setInputCol) for the time being.

sunyichao commented 5 years ago

how to solve it

Jacquelin803 commented 4 years ago

java.util.NoSuchElementException: Failed to find a default value for inputCol

You're using QuantileDiscretizer in multi-column mode (QuantileDiscretizer#setInputCols), which isn't currently supported. Please switch to single-column mode (QuantileDiscretizer#setInputCol) for the time being.

thanks for your reminding,you mean spark3.0 can support mode (QuantileDiscretizer#setInputCols)? i have tried mleap and jpmml,but both of them failed,jpmml also got this same error

sunyichao commented 4 years ago

Thanks you

------------------ 原始邮件 ------------------ 发件人: "Jacquelin1"<notifications@github.com>; 发送时间: 2020年7月23日(星期四) 中午11:23 收件人: "jpmml/jpmml-sparkml"<jpmml-sparkml@noreply.github.com>; 抄送: "风风风"<2401436525@qq.com>; "Comment"<comment@noreply.github.com>; 主题: Re: [jpmml/jpmml-sparkml] Support multi-column transformation modes (in newer Apache Spark versions) (#78)

java.util.NoSuchElementException: Failed to find a default value for inputCol

You're using QuantileDiscretizer in multi-column mode (QuantileDiscretizer#setInputCols), which isn't currently supported. Please switch to single-column mode (QuantileDiscretizer#setInputCol) for the time being.

thanks for your reminding,you mean spark3.0 can support mode (QuantileDiscretizer#setInputCols)? i have tried mleap and jpmml,but both of them failed,jpmml also got this same error

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

vruusmann commented 4 years ago

i have tried mleap and jpmml,but both of them failed,jpmml also got this same error

Bullshit. Recent JPMML-SparkML versions work just fine - see the above commit for when the support for introduced.