Open malathit opened 6 years ago
I get the above exception when the column has null values. Any ideas on how to resolve this?
Apply org.apache.spark.ml.feature.Imputer
to this column first?
What is your Apache Spark version? How does Apache Spark handle columns with missing values - AFAIK it should also crash sooner or later.
Hi,
Thanks for the quick reply. AFAIK the org.apache.spark.ml.feature.Imputer class can be used only on float or double data types. The column that gives me error is String type.
I am using Apache spark 2.2.0.
How does Apache Spark handle columns with missing values - AFAIK it should also crash sooner or later.
In apache spark null values are handled with StringIndexer setInvalid method with value set to "keep". Let me share the simplied code where I can reproduce the issue and share it.
@vruusmann This is the code and it gives the issue
@malathit90 Sorry, I don't have time to debug images.
Here is the snippet giving the error @vruusmann
val a1Idx = new StringIndexer().setInputCol("a1").setOutputCol("a1Indexed").setHandleInvalid("keep")
val featureAssembler = new VectorAssembler().setInputCols(Array("a1Indexed", "a2")).setOutputCol("features");
val labelIndexer = new StringIndexer().setInputCol("a16").setOutputCol("labelIndexed").fit(zeroFilledData);
val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("featuresIndexed").setMaxCategories(15);
val classifier = new RandomForestClassifier().setLabelCol("labelIndexed").setFeaturesCol("featuresIndexed").setImpurity("gini").setPredictionCol("predictionIndexed");
val labelConverter = new IndexToString().setInputCol("predictionIndexed").setOutputCol("prediction").setLabels(labelIndexer.labels);
val pipeline = new Pipeline().setStages(Array(a1Idx, labelIndexer, featureAssembler, featureIndexer, classifier, labelConverter));
val model = pipeline.fit(zeroFilledData)
MetroJAXBUtil.marshalPMML(ConverterUtil.toPMML(df.schema, model), new FileOutputStream("/tmp/out.pmml"))```
I get the above exception when the column has null values. Any ideas on how to resolve this? Please comment if further details are needed.