jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
685 stars 113 forks source link

Define valid categorical values #300

Closed wf-r closed 8 months ago

wf-r commented 2 years ago

Hi,

in PMML 4.4, one can define categorical values as "valid" by adding the corresponding property to the value element: http://dmg.org/pmml/v4-4/DataDictionary.html#xsdElement_Value

For categorical values, this results (if any value has this property) in all values not mentioned in the data field to be regarded as invalid. Is there a possibility to do this in sklearn2pmml?

Currently (to my knowledge), a CategoricalDomain exports its values without the "valid" property. Switching to always exporting values with the "valid" property might break current implementations, as values not mentioned will then be invalid. Could one maybe add a boolean switch to CategoricalDomain, defaulting to the current behaviour?

If this is out of your current scope or time budget, I could try a merge request (albeit a few hints where to look at would make this easier).

Best Wolfgang

vruusmann commented 2 years ago

Currently (to my knowledge), a CategoricalDomain exports its values without the "valid" property.

The Value@property attribute is optional. If omitted (ie. not explicitly declared), it defaults to "valid" (see the ref that you linked).

So, the following two PMML fragments are functionally identical: <Value value="yes"/> and <Value value="yes" property="valid">. By default, all JPMML-family conversion libraries omit redundant attribute values. That's why you're not seeing Value@property="valid" generated.

For categorical values, this results (if any value has this property) in all values not mentioned in the data field to be regarded as invalid. Is there a possibility to do this in sklearn2pmml?

See the Domain.with_data attribute. If set to True, then the Domain.fit(X) method collects and records all unique values in the training dataset, and records them using the Value element; everything that's not listed is considered to be invalid values during Domain.transform(X) and PMML scoring.

If the Domain.with_data attribute is set to False, then nothing is collected and recorded, and all category values are accepted everywhere.

vruusmann commented 2 years ago

The crux of this issue - should the CategoricalDomain constructor accept a data parameter, which specifies the complete valid value space?

Probably yes, because this would allow to "enable" category levels that might not appear in the training dataset.

vruusmann commented 2 years ago

@wf-r FYI, there are sklearn2pmml.decoration.ContinuousDomainEraser and sklearn2pmml.decoration.DiscreteDomainEraser decorator classes available, which allow you to delete the complete collection of <Value property="valid"/> elements from any categorical field at any point in your Scikit-Learn pipeline.

For example, some categorical column encoders (maybe it was OneHotEncoder?) populate DataField/Value elements automatically. You can get rid of those like this:

from sklearn2pmml.decoration import CategoricalDomain, DiscreteDomainEraser

mapper = DataFrameMapper([
  ("color", [CategoricalDomain(with_data = False, OneHotEncoder(), DiscreteDomainEraser())])
])
vruusmann commented 2 years ago

@wf-r If you have general questions about feature decorators (what's the point of X? How to best do Y?), then we may also discuss them in the Openscoring.io Blog comments here: https://openscoring.io/blog/2020/02/23/sklearn_feature_specification_pmml/

wf-r commented 2 years ago

@vruusmann Thank you for your detailed reply. As it turns out, I had misunderstood the documentation and thought that at least one value has to be marked as valid explicitly (and thought to remember that my tests had indicated the same).

It turns out that for my current usage, the behaviour as it is does suffice.

The crux of this issue - should the CategoricalDomain constructor accept a data parameter, which specifies the complete valid value space?

Probably yes, because this would allow to "enable" category levels that might not appear in the training dataset.

I agree that this would be reasonable. Currently I manually adjust the PMML file after its export (which works for me), however your suggestion would be great.

I will close this for now, and leave it up to your priorities whether a data parameter for the constructor is relevant.

vruusmann commented 2 years ago

Reopening. I'll close it with an on-topic commit (add support for the Domain.data attribute) one some nice day. This change would affect the Python side only, so it won't be much work.