Closed ohmystack closed 4 years ago
@vruusmann Would you mind reviewing this?
No objections - this is a valid code change.
I'm more interested about how did you find it? Do you have a (reproducible-) use case, where JPMML-SparkML fails because of this sanity check?
env: spark 2.2.1 pipelineStages: VectorAssembler, StringIndexer, XGBoostEstimator (classification)
In the 2nd stage, StringIndexerConverter
uses transformer.labels()
to generate categories for the label column. code
The array returned byStringIndexerModel.labels()
is not sorted, can be [0, 1]
or [1, 0]
.
Then, the field is encoded using this array.
In the 3rd stage, when encoding the schema, my code runs into the ContinuousFeature
condition and the issue occurs.
Maybe my code has some problem. In the 3rd stage, the label should have been converted to CategoricalFeature
already, instead of converting it from ContinuousFeature
again. But if it happens, the code will break there.
Here is an example of returning [1, 0]
in transformer.labels()
.
train.csv
LABEL,col1,col2
1,628,787
1,75,1794
1,444,14899
1,53,500
1,997,286
1,613,729
1,031,2245
1,02,5081
1,850,404
1,647,560
1,78,1517
1,75,4977
0,977,311
1,812,532
1,472,10822
0,570,2304
1,519,690
1,834,843
1,054,30
spark-shell
scala> val data = spark.sqlContext.read.format("csv").
| options(Map(
| "header" -> "true",
| "ignoreLeadingWhiteSpace" -> "true",
| "ignoreTrailingWhiteSpace" -> "true",
| "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
| "inferSchema" -> "true",
| "mode" -> "FAILFAST")).
| load("train.csv")
scala> val labelColName = "LABEL"
scala> import org.apache.spark.ml.feature.StringIndexer
scala> val labelIndexer = new StringIndexer().setInputCol(labelColName).setOutputCol("label")
scala> val labelIndexerModel = labelIndexer.fit(data)
labelIndexerModel: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e56af3540e9d
scala> labelIndexerModel.labels
res0: Array[String] = Array(1, 0) // <- Here's the [1, 0] values
The problem is that StringIndexerModel#labels()
returns labels in the order of "popularity". In most datasets the 0
label is more popular than the 1
label (eg. there are many more non-fraud cases than there are fraud cases), so the labels are ordered [0, 1]
. However, your dataset appears to have many more 1
labels than 0
labels, so the labels are ordered [1, 0]
, and this sanity check fails.
This issue needs more investigation. The order of target labels is critical, because it determines the encoding of model objects. For example, LogisticRegressionModel
performs the computation relative to the second target category, so if we pass "unexpected" ordering of labels, then we will get a misbehaving logistic regression model (will predict the probability 0
class where the probability of 1
class is expected, and vice versa).
Agree to have more investigation.
Related to https://github.com/jpmml/jpmml-sparkml/issues/14
There should be ContinuousDomain
and CategoricalDomain
meta-transforrmer classes that collect canonical label information.
According to the PMML docs ,
"categorical" doesn't require an order defined. Users should use "ordinal" field instead of "categorical" if they want the order. So, no need to check categorical field's order.
For example, in jpmml-spark here , the values' order is hardcoded. The hardcoded order
[0, 1]
may be different from the order given by spark ML[1, 0]
. For categorical field, they should be the same thing.