jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Unable to export spark ml pipeline having vector indexer #73

Closed Megaman21 closed 5 years ago

Megaman21 commented 5 years ago

Hi, I have created a dummy data set and trying to export a spark ml pipeline to pmml file. This pipeline contains string indexer, vector assembler, vector indexer and random forest classifier. Pipeline is created and fit successfully but when it attempts to build PMML file; I get the following exception: Py4JJavaError: An error occurred while calling o17027.buildFile. : java.lang.IllegalArgumentException: Field Type has valid values [A, C, B] at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:230) at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:79) at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48) at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:110) at org.jpmml.sparkml.PMMLBuilder.buildFile(PMMLBuilder.java:262) at sun.reflect.GeneratedMethodAccessor291.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

Here "Type" is my categorical column. When I remove the vector indexer from the pipeline; PMML is successfully created. Any idea why is this happening? Following is my dataset and code: Type|long_num|lat_num|hit| +----+--------+-------+---+ | A| 22.0| 89.5 |1.0| | A| 32.0| 64.0 |1.0| | B| 11.0| 32.0 |0.0| | C| 42.0| 11.0 |1.0| | C| 76.0| 56.0 |0.0| +----+--------+-------+---+ Code:

string_indexer=StringIndexer(inputCol="Type",outputCol="typeIndex")
assembler=VectorAssembler(inputCols=["typeIndex","long_num","lat_num"],outputCol="features")
vector_indexer=VectorIndexer(inputCol="features",outputCol="features_indexed",maxCategories=3)
rf=RandomForestClassifier(labelCol="hit",featuresCol="features_indexed",numTrees=10)
pipeline=Pipeline(stages=[string_indexer,assembler,vector_indexer,rf])
model = pipeline.fit(df)
pmmlBuilder= PMMLBuilder(sc,df,model).putOption(None, sc._jvm.org.jpmml.sparkml.model.HasTreeOptions.OPTION_COMPACT,True)
pmmlBuilder.buildFile("RandomForestFraud3.pmml")
vruusmann commented 5 years ago

java.lang.IllegalArgumentException: Field Type has valid values [A, C, B]

You're defining the value space of Type column in two places - first using StringIndexer, and then using VectorIndexer. This exception is raised, because these two definitions are in conflict with one another (presumably, StringIndexer orders category values by popularity, whereas VectorIndexer orders them by something else - perhaps lexicographically).

If you're already applying StringInederex to a column, then there is no point in re-indexing it the second time.

Not fixing anything in JPMML-SparkML library here - I think it's correct behaviour to reject a non-sensical pipeline (although the exception message could be a bit more informative).