jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Support for `OrdinalEncoder.handle_unknown` attribute #162

Closed kornilcdima closed 3 years ago

kornilcdima commented 3 years ago

Hello @vruusmann,

I faced with a problem. When I convert sklearn-pipeline to pmml-pipeline the option of handling unknown values for categorical columns disappears. Do you know how to deal with this problem? Appreciate any help.

sklearn-pipeline:

cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy="constant", missing_values="-1")),
    ('oe', OrdinalEncoder(handle_unknown='use_encoded_value',
                            unknown_value=-1))])

When I look at pmml-file I don't see that it handles unknown values.

<DerivedField name="encoder(Netspeed)" optype="categorical" dataType="double">
    <MapValues outputColumn="data:output">
        <FieldColumnPair field="Netspeed" column="data:input"/>
        <InlineTable>
            <row>
                <data:input>CABLE_DSL</data:input>
                <data:output>0.0</data:output>
            </row>
            <row>
                <data:input>CELLULAR</data:input>
                <data:output>1.0</data:output>
            </row>
        </InlineTable>
    </MapValues>

Everything works as expected on Python but it is not the same on Java (jpmml-pipeline). On Java it throws an error when a new value appears: org.jpmml.evaluator.InvalidResultException: Field "Netspeed" cannot accept user input value "new_value"

vruusmann commented 3 years ago

On Java it throws an error when a new value appears: org.jpmml.evaluator.InvalidResultException: Field "Netspeed" cannot accept user input value "new_value"

This invalid value check is performed in relation to DataField elements (primary input). You are generating DerivedField elements (secondary input).

It is assumed that only the primary input data can be invalid (eg. the CSV cell contains a mis-spelled value). It does not make sense to generate invalid values intentionally during data pre-processing.

vruusmann commented 3 years ago

Once again, what kind of OrdinalEncoder class are you using here? Is it sklearn.preprocessing.OrdinalEncoder or categorical_encoders.OrdinalEncoder?

kornilcdima commented 3 years ago

Once again, what kind of OrdinalEncoder class are you using here? Is it sklearn.preprocessing.OrdinalEncoder or categorical_encoders.OrdinalEncoder?

sklearn.preprocessing.OrdinalEncoder

kornilcdima commented 3 years ago

On Java it throws an error when a new value appears: org.jpmml.evaluator.InvalidResultException: Field "Netspeed" cannot accept user input value "new_value"

This invalid value check is performed in relation to DataField elements (primary input). You are generating DerivedField elements (secondary input).

It is assumed that only the primary input data can be invalid (eg. the CSV cell contains a mis-spelled value). It does not make sense to generate invalid values intentionally during data pre-processing.

Sorry, I don't follow. In my case the PMML-pipeline gets a columns which has new values (which were never seen before by the model). And the PMML-pipeline breaks. However, it shouldn't as I understand. The expected behavior is that PMML replaces unknown values with -1. And when you do the same thing on Python, everything works. Although, that is not the case on PMML.

Other fields of the pmml-file: <MiningField name="Netspeed" importance="52.0" missingValueReplacement="missing_value" missingValueTreatment="asValue"/>

<DataField name="Netspeed" optype="categorical" dataType="string">
    <Value value="-1" property="missing"/>
    <Value value="CABLE_DSL"/>
    <Value value="CELLULAR"/>
</DataField>
kornilcdima commented 3 years ago

I guess I should get something like this? <MiningField name="Netspeed" importance="59.0" missingValueReplacement="Unknown" missingValueTreatment="asIs" invalidValueTreatment="asMissing"/>

I am able to get this only with the following processing.

cat_transformer = PMMLPipeline(steps=[
        ('cd', CategoricalDomain(invalid_value_treatment="as_missing",missing_value_replacement="Unknown")),
        ('oe', PMMLLabelEncoder(missing_values=-1))
])

num_transformer = PMMLPipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ])

transformers = [(f'{i}_tr', cat_transformer, [i]) for i in cat_cols] \
            + [(f'{i}_tr', num_transformer, [i]) for i in num_cols]

model = lgb.LGBMRegressor(**m.model_hyperprams)
preprocessor = ColumnTransformer(transformers=transformers, remainder='passthrough')

pipeline = PMMLPipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])
pipeline.fit(X, y)
vruusmann commented 3 years ago

Sorry, I don't follow.

TLDR: You should perform validity check just once, when the data enters the pipeline. There is no point in accepting invalid data, performing a transformation on it (eg. ordinal encoding) and then checking if the transformation results is valid or not.

And the PMML-pipeline breaks. However, it shouldn't as I understand.

The JPMML-SkLearn library is not taking the OrdinalEncoder.handle_unknown attribute into consideration right now.

At minimum, it should throw an IllegalArgumentException just to inform you that this piece of Scikit-Learn functionality cannot be used.

The expected behavior is that PMML replaces unknown values with -1.

Feel free to submit a PR.

I guess I should get something like this?

You clearly do not understand the difference between missing values and invalid values. Please consult with the basics here: http://dmg.org/pmml/v4-3/MiningSchema.html

vruusmann commented 3 years ago

This issue looks like a duplicate of https://github.com/jpmml/sklearn2pmml/issues/289

@kornilcdima If you want me to pay attention to something, then you can't just open/close/open issues randomly. This is your last warning.

kornilcdima commented 3 years ago

This issue looks like a duplicate of jpmml/sklearn2pmml#289

@kornilcdima If you want me to pay attention to something, then you can't just open/close/open issues randomly. This is your last warning.

Thank you for your time and fast answers. The issue is not a duplicate. May be It looks like a duplicate because it is connected with the same part of the code and where I had a question. But my last question was related to a different problem which I unfortunately failed to solve. In the previous issue I was asking about handling missing values in the input and it was not connected with Encoder directly. Of course you are the moderator there, but I wouldn't name these two topics by the same title.