Unable to export spark ml pipeline having vector indexer

Hi, I have created a dummy data set and trying to export a spark ml pipeline to pmml file. This pipeline contains string indexer, vector assembler, vector indexer and random forest classifier. Pipeline is created and fit successfully but when it attempts to build PMML file; I get the following exception: Py4JJavaError: An error occurred while calling o17027.buildFile. : java.lang.IllegalArgumentException: Field Type has valid values [A, C, B] at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:230) at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:79) at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48) at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:110) at org.jpmml.sparkml.PMMLBuilder.buildFile(PMMLBuilder.java:262) at sun.reflect.GeneratedMethodAccessor291.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

Here "Type" is my categorical column. When I remove the vector indexer from the pipeline; PMML is successfully created. Any idea why is this happening? Following is my dataset and code: Type|long_num|lat_num|hit| +----+--------+-------+---+ | A| 22.0| 89.5 |1.0| | A| 32.0| 64.0 |1.0| | B| 11.0| 32.0 |0.0| | C| 42.0| 11.0 |1.0| | C| 76.0| 56.0 |0.0| +----+--------+-------+---+ Code:

string_indexer=StringIndexer(inputCol="Type",outputCol="typeIndex")
assembler=VectorAssembler(inputCols=["typeIndex","long_num","lat_num"],outputCol="features")
vector_indexer=VectorIndexer(inputCol="features",outputCol="features_indexed",maxCategories=3)
rf=RandomForestClassifier(labelCol="hit",featuresCol="features_indexed",numTrees=10)
pipeline=Pipeline(stages=[string_indexer,assembler,vector_indexer,rf])
model = pipeline.fit(df)
pmmlBuilder= PMMLBuilder(sc,df,model).putOption(None, sc._jvm.org.jpmml.sparkml.model.HasTreeOptions.OPTION_COMPACT,True)
pmmlBuilder.buildFile("RandomForestFraud3.pmml")

java.lang.IllegalArgumentException: Field Type has valid values [A, C, B]

You're defining the value space of Type column in two places - first using StringIndexer, and then using VectorIndexer. This exception is raised, because these two definitions are in conflict with one another (presumably, StringIndexer orders category values by popularity, whereas VectorIndexer orders them by something else - perhaps lexicographically).

If you're already applying StringInederex to a column, then there is no point in re-indexing it the second time.

Not fixing anything in JPMML-SparkML library here - I think it's correct behaviour to reject a non-sensical pipeline (although the exception message could be a bit more informative).

jpmml / jpmml-sparkml

Unable to export spark ml pipeline having vector indexer #73