jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

MultiDomain Expression Transformer default value #189

Closed nhawrylyshyn closed 10 months ago

nhawrylyshyn commented 10 months ago

Hi I followed the examples here : https://openscoring.io/blog/2020/02/23/sklearn_feature_specification_pmml to create 4 continuous domain features and a 5th feature which was randomly picked as an expression to be the ratio of the first and second column. Things work when all values are well defined. However when I modify the dataset to have None or undefined values the ExpressionTransformer fails "ValueError: Input contains NaN, infinity or a value too large for dtype('float32')." (example given).

How can I control missing value / erroneous values in the ExpressionTransformer block i.e. I would like either the missing value replacement from the numeric domain mapper to be applied or to be able to set missing_value_replacement on it ? Is this possible ?

Thank you for help.

-NH

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
X.iloc[0, 0] = None
numeric_columns = X.columns
y = iris.target

numeric_mapper_domain = [
    (
        [numeric_column],
        ContinuousDomain(missing_value_treatment="as_value", invalid_value_treatment="as_missing", missing_value_replacement=-999)
    )
    for numeric_column in numeric_columns
]

# https://openscoring.io/blog/2020/02/23/sklearn_feature_specification_pmml/
numeric_mapper_domain.append(
    (
        ['sepal length (cm)', 'sepal width (cm)'],
        [
            MultiDomain([None, None]),
            Alias(ExpressionTransformer('24 * X[0]/(X[1]+0.0000001)'), 'R0_1')
        ]
    )
)

# Create a PMMLPipeline
pmml_pipeline = PMMLPipeline(
    [
        ("mapper", DataFrameMapper(numeric_mapper_domain)),
        ("classifier", DecisionTreeClassifier()) # as an example
    ]
)

pmml_pipeline.target_fields = ["target"]
pmml_pipeline.fit(X, y)
vruusmann commented 10 months ago

I followed the examples here : https://openscoring.io/blog/2020/02/23/sklearn_feature_specification_pmml

If you have a GitHub account, then you could ask your question(s) also in the blog's "feedback" section.

This particular issue would be a very good fit there - adding more explanations/code examples about a specific functionality.

Anyway, the primary intent of MultiDomain decorator is to allow you to perform decoration on a mixed list of categorical and continuous features. If you have only continuous features, then you can use good old ContinuousDomain as-is.

Please note that ContinuousDomain has multi-column support, whereas CategoricalDomain hasn't. If you need to feed multiple categorical features to an ExpressionTransformer, then you can bind/reorder elementary categorical decorators together using MultiDomain.

How can I control missing value / erroneous values in the ExpressionTransformer block

Domain decorator classes are about capturing the domain of input features. They are not intended for performing additional transformations (such as missing or invalid value replacement) on already transformed features.

You should check out ExpressionTransformer.map_missing_to and ExpressionTransformer.default_value attributes, which correspond to Apply@mapMissingTo and Apply@defaultValue attributes, respectively: https://dmg.org/pmml/v4-4-1/Functions.html#xsdElement_Apply

See the "Output table for Apply" sub-section on the referenced page.

I would like to be able to set missing_value_replacement on it

transformer = ExpressionTransformer('X[0] / (X[1] + 0.0000001)', map_missing_to = -1)

Your current expression "defends" against by division-by-zero errors by adding a small constant (0.0000001) to the denominator.

You can get rid of it, and map all division-by-zero errors to a specific error code:

transformer = ExpressionTransformer('X[0] / (X[1]', default_value = -2)