jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
682 stars 111 forks source link

`ExpressionTransformer` should try to rectify feature type information #397

Open woodly0 opened 9 months ago

woodly0 commented 9 months ago

Hello Villu,

it's been a while and I hope you're fine. I've come back more questions. Let's start with some code:

# create some data
X = pd.DataFrame(
    {
        "numbers": [1, 2, 3, 40, 5],
        "colors": ["yellow ", "blue", "BLACK", "green", "red"],
    }
)

# create a simple mapper
mapper = DataFrameMapper(
    [
        (
            ["colors"],
            [
                # CategoricalDomain(dtype=str),
                ExpressionTransformer("X[0].lower()"),
                MatchesTransformer("green"),
            ],
            {"alias": "color_green"},
        )
    ],
    df_out=True,
    default=False,
)

The following pipeline doesn't make much sense from a machine learning poit of view, but it shows the issue very well:

pmml_pipe = PMMLPipeline(
    [
        ("mapper", mapper)
    ]
)
# fit and transform
pmml_pipe.fit_transform(X)

# export as PMML
sklearn2pmml(pmml_pipe, "output.pmml", with_repr=True)

In Python, everything works as expected. Now the issue is within the generated output.pmml file, where you can find the following:

<DataDictionary>
    <DataField name="colors" optype="continuous" dataType="double"/>
</DataDictionary>

Knowing that the input has an infinte amount of possible values, how can I set this data type to "string"?

vruusmann commented 9 months ago

It's pretty unusual to start the pipeline with an ExpressionTransformer object. There should be some "clarifications" in front of it.

The simplest way to make a clarification is to give feature specification using one of SkLearn2PMML decorators (eg. sklearn2pmml.decoration.CategoricalDomain, OrdinalDomain or ContinuousDomain).

You already have CategoricalDomain in place, but have it commented out. You probably didn't like that it captured the valid value space of your X dataset:

<DataDictionary>
    <DataField name="colors" optype="categorical" dataType="string">
        <Value value="BLACK"/>
        <Value value="blue"/>
        <Value value="green"/>
        <Value value="red"/>
        <Value value="yellow "/>
    </DataField>
</DataDictionary>

Well, if you don't like the valid value space information, then simply disable it using the with_data = False flag:

color_transformers = [
    # THIS!
    CategoricalDomain(dtype = str, with_data = False),
    ExpressionTransformer("X[0].lower()"),
    MatchesTransformer("green"),
]
vruusmann commented 9 months ago

It's pretty unusual to start the pipeline with an ExpressionTransformer object. There should be some "clarifications" in front of it.

The ExpressionTransformer can triangulate its position in the pipeline by observing if there are any wildcard features (ie. org.jpmml.converter.WildcardFeature objects) among the arguments.

If there are, then it should make effort to rectify their types. For example, if there are string methods being called on a wildcard feature, then it's reasonable to assume that the type of this feature should be categorical+string (instead of continuous+double).

It is likely that such type rectification should happen during expression parsing phase, which means that the code change should land in the JPMML-Python library instead.

woodly0 commented 9 months ago

Well, if you don't like the valid value space information, then simply disable it using the with_data = False flag

This was exactly what I was looking for. Thank you!

The ExpressionTransformer can triangulate its position in the pipeline by observing if there are any wildcard features (ie. org.jpmml.converter.WildcardFeature objects) among the arguments.

So you are saying that it is still OK to start the pipeline with an ExpressionTransformer or should it be generally avoided?

vruusmann commented 9 months ago

So you are saying that it is still OK to start the pipeline with an ExpressionTransformer or should it be generally avoided?

The ExpressionTransformer works best with non-wildcard features.

The easiest way to convert a wildcard feature (has continuous+double type) to a non-wildcard feature is to use SkLearn2PMML decorators.

vruusmann commented 9 months ago

Alternatively, the ExpressionTransformer should simply raise a value error when there are wildcard features among the arguments.

IMO, it's better to have the conversion fail, rather than to have it produce a invalid/incomplete PMML document.