jpmml / jpmml-sparkml

Java library and command-line application for converting Apache Spark ML pipelines to PMML
GNU Affero General Public License v3.0
267 stars 80 forks source link

Handling columns with null values #44

Open malathit opened 6 years ago

malathit commented 6 years ago
Exception in thread "main" java.lang.IllegalArgumentException: Field a1 has valid values [b, a]
    at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:189)
    at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:98)
    at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
    at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:96)
    at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:68)

I get the above exception when the column has null values. Any ideas on how to resolve this? Please comment if further details are needed.

vruusmann commented 6 years ago

I get the above exception when the column has null values. Any ideas on how to resolve this?

Apply org.apache.spark.ml.feature.Imputer to this column first?

What is your Apache Spark version? How does Apache Spark handle columns with missing values - AFAIK it should also crash sooner or later.

malathit commented 6 years ago

Hi,

Thanks for the quick reply. AFAIK the org.apache.spark.ml.feature.Imputer class can be used only on float or double data types. The column that gives me error is String type.

I am using Apache spark 2.2.0.

malathit commented 6 years ago

How does Apache Spark handle columns with missing values - AFAIK it should also crash sooner or later.

In apache spark null values are handled with StringIndexer setInvalid method with value set to "keep". Let me share the simplied code where I can reproduce the issue and share it.

malathit commented 6 years ago

random-forest @vruusmann This is the code and it gives the issue

vruusmann commented 6 years ago

@malathit90 Sorry, I don't have time to debug images.

malathit commented 6 years ago

Here is the snippet giving the error @vruusmann


val a1Idx = new StringIndexer().setInputCol("a1").setOutputCol("a1Indexed").setHandleInvalid("keep")

val featureAssembler = new VectorAssembler().setInputCols(Array("a1Indexed", "a2")).setOutputCol("features");

val labelIndexer = new StringIndexer().setInputCol("a16").setOutputCol("labelIndexed").fit(zeroFilledData);

val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("featuresIndexed").setMaxCategories(15);

val classifier = new RandomForestClassifier().setLabelCol("labelIndexed").setFeaturesCol("featuresIndexed").setImpurity("gini").setPredictionCol("predictionIndexed");

val labelConverter = new IndexToString().setInputCol("predictionIndexed").setOutputCol("prediction").setLabels(labelIndexer.labels);

val pipeline = new Pipeline().setStages(Array(a1Idx, labelIndexer, featureAssembler, featureIndexer, classifier, labelConverter));

val model = pipeline.fit(zeroFilledData)

MetroJAXBUtil.marshalPMML(ConverterUtil.toPMML(df.schema, model), new FileOutputStream("/tmp/out.pmml"))```