jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Category Encoders handle_unknown="value" #168

Closed nhawrylyshyn closed 1 year ago

nhawrylyshyn commented 2 years ago

Hi,

I've followed the examples supplied for using category_encoders package at the following link (https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/extensions/category_encoders.py). If I use the examples and change handle_unknown = "error" to handle_unknown = "value" as in the code below, then sklearn2pmml will fail with error at the bottom. Is this behavior intended to only allow handle_unknown='error' for category encoders package. If so, are there work arounds for passing unseen values to the pipeline which won't result in an error when using category_encoders.

classifier = XGBClassifier(**params_to_fit)

cont_encoder = "passthrough"

cat_encoder = CatBoostEncoder(handle_missing = "value",
                              handle_unknown = "value",
                              random_state = 0,
                              a = 1)

steps = [
    ("mapper", ColumnTransformer([ ("cat", cat_encoder, cat_cols), ("cont", cont_encoder, cont_cols) ])),
    ("classifier", classifier)
]

pipeline = PMMLPipeline(steps)
pipeline.fit(X_train, y_train)

sklearn2pmml(pipeline, "out.pmml", with_repr = True, debug=True)

python: 3.8.10 sklearn: 1.0 sklearn2pmml: 0.74.4 joblib: 1.0.1 sklearn_pandas: 2.2.0 pandas: 1.2.5 numpy: 1.19.5 openjdk: 1.8.0_292

Standard output is empty Standard error: Sep 29, 2021 8:47:56 PM org.jpmml.sklearn.Main run INFO: Parsing PKL.. Sep 29, 2021 8:47:56 PM org.jpmml.sklearn.Main run INFO: Parsed PKL in 174 ms. Sep 29, 2021 8:47:56 PM org.jpmml.sklearn.Main run INFO: Converting PKL to PMML.. Sep 29, 2021 8:47:56 PM org.jpmml.sklearn.Main run SEVERE: Failed to convert PKL to PMML java.lang.IllegalArgumentException: value at category_encoders.MeanEncoder.encodeFeatures(MeanEncoder.java:80) at sklearn.Transformer.encode(Transformer.java:70) at sklearn.compose.ColumnTransformer.encodeFeatures(ColumnTransformer.java:63) at sklearn.Transformer.encode(Transformer.java:70) at sklearn.Composite.encodeFeatures(Composite.java:119) at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:211) at org.jpmml.sklearn.Main.run(Main.java:217) at org.jpmml.sklearn.Main.main(Main.java:143)

Exception in thread "main" java.lang.IllegalArgumentException: value at category_encoders.MeanEncoder.encodeFeatures(MeanEncoder.java:80) at sklearn.Transformer.encode(Transformer.java:70) at sklearn.compose.ColumnTransformer.encodeFeatures(ColumnTransformer.java:63) at sklearn.Transformer.encode(Transformer.java:70) at sklearn.Composite.encodeFeatures(Composite.java:119) at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:211) at org.jpmml.sklearn.Main.run(Main.java:217) at org.jpmml.sklearn.Main.main(Main.java:143)

vruusmann commented 2 years ago

Is this behavior intended to only allow handle_unknown='error' for category encoders package.

It's intended to be this way, because it's not worth time/effort to have it in any other way.

The general idea is that "why would a transformer somewhere in the middle of your pipeline need to worry about invalid (aka unknown) values? How can invalid values even get that far, why weren't they caught and handled much-much earlier?"

If so, are there work arounds for passing unseen values to the pipeline which won't result in an error when using category_encoders.

The SkLearn2PMML package provides so-called "domain decorator" classes, which let you make assertions about data as it enters your pipeline: https://github.com/jpmml/sklearn2pmml/blob/0.74.4/sklearn2pmml/decoration/__init__.py

Code related to invalid value handling: https://github.com/jpmml/sklearn2pmml/blob/0.74.4/sklearn2pmml/decoration/__init__.py#L52-L61 https://github.com/jpmml/sklearn2pmml/blob/0.74.4/sklearn2pmml/decoration/__init__.py#L102-L112

More background here: Extending Scikit-Learn with feature specifications

Even more background here (search for "missing" and "invalid" keywords): http://dmg.org/pmml/v4-4-1/MiningSchema.html

My code examples typically use domain decorators such as CategoricalDomain, ContinuousDomain etc (OK, they're omitted in the 'category_examples` case, because otherwise missing and invalid values would be unable to enter the pipeline for integration testing purposes). Your code example doesn't, but should.

TLDR: The "PMML way" of handling invalid (aka unknown) values:

from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

cat_encoder = make_pipeline([CategoricalDomain(invalid_value_treatment = "as_value", invalid_value_replacement = "one"), CatBoostEncoder()])
cont_encoder = make_pipeline([ContinuousDomain(invalid_value_treatment = "as_value", invalid_value_replacement = 1)])

mapper = ColumnTransformer([ 
  ("cat", cat_encoder, cat_cols), 
  ("cont", cont_encoder, cont_cols) 
]
vruusmann commented 2 years ago

Let's keep this issue open for some time.

If nothing else, then the converter code for category_encoders should raise a more informative error stating that "please handle invalid values before this step, for example, using domain decorators".

vruusmann commented 1 year ago

This issue was addressed between JPMML-SkLearn versions 1.6.33 and 1.6.34 (December 2021).