jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

StringIndexer、VectorAssembler and XGBoostRegressor pipeline can not save as pmml file #99

Closed cug-wuyu closed 4 years ago

cug-wuyu commented 4 years ago

when i save the pipeline which contains StringIndexer (use the method to labelEncode on category type feature)、VectorAssembler and XGBoostRegressor as pmml file, the program print the following error:

java.lang.IllegalArgumentException: Field fea_rong_0 has data type string at org.jpmml.converter.PMMLEncoder.toContinuous(PMMLEncoder.java:208) at org.jpmml.converter.CategoricalFeature.toContinuousFeature(CategoricalFeature.java:57) at org.jpmml.converter.Feature.toContinuousFeature(Feature.java:53) at org.jpmml.sparkml.xgboost.BoosterUtil$1.apply(BoosterUtil.java:69) at org.jpmml.sparkml.xgboost.BoosterUtil$1.apply(BoosterUtil.java:57) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.jpmml.converter.Schema.toTransformedSchema(Schema.java:97) at org.jpmml.sparkml.xgboost.BoosterUtil.encodeBooster(BoosterUtil.java:80) at org.jpmml.sparkml.xgboost.XGBoostRegressionModelConverter.encodeModel(XGBoostRegressionModelConverter.java:40) at org.jpmml.sparkml.xgboost.XGBoostRegressionModelConverter.encodeModel(XGBoostRegressionModelConverter.java:28) at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:171) at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:120) at com.rong360.jianhang.spark.newSpark.RunXGBRegression$.main(RunXGBRegression.scala:47) at com.rong360.jianhang.spark.newSpark.RunXGBRegression.main(RunXGBRegression.scala)

vruusmann commented 4 years ago

java.lang.IllegalArgumentException: Field fea_rong_0 has data type string at org.jpmml.converter.PMMLEncoder.toContinuous(PMMLEncoder.java:208) at org.jpmml.converter.CategoricalFeature.toContinuousFeature(CategoricalFeature.java:57)

You're trying to use a categorical/string feature in context which requires continuous/numeric feature.

In the current case, please encode categorical features using the OneHotEncoder (or similar) transformation (the data flow for string columns should be StringIndexer -> OneHotEncoder -> VectorAssembler).

This is not a bug. In fact, the JPMML-SparkML library helped to reveal an invalid Apache Spark ML pipeline here.

cug-wuyu commented 4 years ago

@vruusmann thanks, but in this issue #73 , his pipeline struct (stringindex, vectorassemble, rf) is similar as mine, why he successfully created pmml file while i was failed。

vruusmann commented 4 years ago

@cug-wuyu The pipeline presented in issue #73 is also invalid - categorical features have NOT been properly prepared there.

Apache Spark ML lets you do stupid things. JPMML-SparkML informs you about most critical mistakes (eg. improper encoding of categorical features), hoping that you'll appreciate it and fix your mistake.

Sure, the exception message could be more informative.