jpmml / jpmml-converter

Java library for authoring PMML
GNU Affero General Public License v3.0
15 stars 4 forks source link

Remove the check for categorical values order #9

Closed ohmystack closed 4 years ago

ohmystack commented 6 years ago

According to the PMML docs ,

Categorical field values can only be tested for equality, while ordinal field values in addition have an order defined.

If the application algorithm for some class of models does not require any special treatment of ordinal fields, then they can be interpreted as categorical fields.

"categorical" doesn't require an order defined. Users should use "ordinal" field instead of "categorical" if they want the order. So, no need to check categorical field's order.

For example, in jpmml-spark here , the values' order is hardcoded. The hardcoded order [0, 1] may be different from the order given by spark ML [1, 0]. For categorical field, they should be the same thing.

ohmystack commented 6 years ago

@vruusmann Would you mind reviewing this?

vruusmann commented 6 years ago

No objections - this is a valid code change.

I'm more interested about how did you find it? Do you have a (reproducible-) use case, where JPMML-SparkML fails because of this sanity check?

ohmystack commented 6 years ago

env: spark 2.2.1 pipelineStages: VectorAssembler, StringIndexer, XGBoostEstimator (classification)

In the 2nd stage, StringIndexerConverter uses transformer.labels() to generate categories for the label column. code The array returned byStringIndexerModel.labels() is not sorted, can be [0, 1] or [1, 0]. Then, the field is encoded using this array.

In the 3rd stage, when encoding the schema, my code runs into the ContinuousFeature condition and the issue occurs.

Maybe my code has some problem. In the 3rd stage, the label should have been converted to CategoricalFeature already, instead of converting it from ContinuousFeature again. But if it happens, the code will break there.


Here is an example of returning [1, 0] in transformer.labels().

train.csv

LABEL,col1,col2
1,628,787
1,75,1794
1,444,14899
1,53,500
1,997,286
1,613,729
1,031,2245
1,02,5081
1,850,404
1,647,560
1,78,1517
1,75,4977
0,977,311
1,812,532
1,472,10822
0,570,2304
1,519,690
1,834,843
1,054,30
spark-shell

scala> val data = spark.sqlContext.read.format("csv").
     |   options(Map(
     |     "header" -> "true",
     |     "ignoreLeadingWhiteSpace" -> "true",
     |     "ignoreTrailingWhiteSpace" -> "true",
     |     "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
     |     "inferSchema" -> "true",
     |     "mode" -> "FAILFAST")).
     |   load("train.csv")

scala> val labelColName = "LABEL"

scala> import org.apache.spark.ml.feature.StringIndexer

scala> val labelIndexer = new StringIndexer().setInputCol(labelColName).setOutputCol("label")

scala> val labelIndexerModel = labelIndexer.fit(data)
labelIndexerModel: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e56af3540e9d

scala> labelIndexerModel.labels
res0: Array[String] = Array(1, 0)       // <- Here's the [1, 0] values
vruusmann commented 6 years ago

The problem is that StringIndexerModel#labels() returns labels in the order of "popularity". In most datasets the 0 label is more popular than the 1 label (eg. there are many more non-fraud cases than there are fraud cases), so the labels are ordered [0, 1]. However, your dataset appears to have many more 1 labels than 0 labels, so the labels are ordered [1, 0], and this sanity check fails.

This issue needs more investigation. The order of target labels is critical, because it determines the encoding of model objects. For example, LogisticRegressionModel performs the computation relative to the second target category, so if we pass "unexpected" ordering of labels, then we will get a misbehaving logistic regression model (will predict the probability 0 class where the probability of 1 class is expected, and vice versa).

ohmystack commented 6 years ago

Agree to have more investigation.

vruusmann commented 6 years ago

Related to https://github.com/jpmml/jpmml-sparkml/issues/14

There should be ContinuousDomain and CategoricalDomain meta-transforrmer classes that collect canonical label information.