Closed vikatskhay closed 7 years ago
The StringIndexer
transformation translates string labels to indexes. The indexes are assigned by "popularity", so the most frequent string label will be mapped to 0
, the second most frequent string label will be mapped to 1
, and so on.
The output of a StringIndexer
transformation is a numeric column, but it's devoid of any meaning (eg. if your example has mapping FR -> 0
and DE -> 1
, then what does it mean - France is "less than" Germany?).
You can give this numeric column meaning by binarizing it in a "one-vs-rest" fashion using the OneHotEncoder
transformation:
StringIndexer countryIndexer = new StringIndexer()
.setInputCol("country")
.setOutputCol("country_index");
// THIS!
OneHotEncoder countryBinarizer = new OneHotEncoder()
.setInputCol("country_index")
.setOutputCol("country_bitvector");
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"country_bitvector", "a", "b"})
.setOutputCol("features");
Exception in thread "main" java.lang.UnsupportedOperationException at org.jpmml.converter.CategoricalFeature.toContinuousFeature(CategoricalFeature.java:63)
Basically, the JPMML-SparkML library has "detected" that you're trying to invoke a categorical feature in a context that requires a continuous feature.
It's a valid exception, because you should never pass a "raw" StringIndexer
output column to any ML algorithm (not just LogisticRegression
). Sure, in order to avoid confusion, the type of this exception needs to be something other than java.lang.UnsupportedOperationException
, and there needs to be a proper message (eg. java.lang.IllegalArgumentException("Cannot cast a feature from categorical operational type to continuous operational type")
).
By the way, I also tried the same code with the library version 1.0.9 and Spark 1.6, it did get exported.
Apache Spark 1.6.X and JPMML-SparkML 1.0.X are no longer supported.
The export operation succeeds, but the resulting PMML document is non-sensical - it contains an instruction to multiply the name of country by 1.2203484517215881
.
Thanks a lot @vruusmann ! That's really helpful.
Hi,
I'm testing a very simple case just to evaluate the library and ran into an issue. Here's the code:
Here's a piece of relevant output (
predictions.show()
and the exception):the training data:
The exception is thrown when the country feature is handled in
RegressionModelUtil.createRegressionTable()
.Am I doing something wrong? Or it seems like using StringIndexer with LogisticRegression is not working right.
By the way, I also tried the same code with the library version
1.0.9
and Spark 1.6, it did get exported:however evaluating this PMML didn't work:
Thank you very much beforehand!