jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
686 stars 113 forks source link

How to reference categorical input feature(s) during the post-processing in PMML pipeline #425

Closed andru98 closed 4 months ago

andru98 commented 4 months ago

I am able to access the attributes (POST PROCESSING) but i am getting wrong predictions while i load the model form pmml/xml file.Here is the code.

import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn2pmml.cross_reference import Memory
from sklearn2pmml.cross_reference import make_memorizer_union, make_recaller_union
from sklearn2pmml.pipeline import PMMLPipeline,Pipeline
from sklearn.model_selection import train_test_split
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OrdinalEncoder,LabelEncoder
from sklearn2pmml.decoration import ContinuousDomain, CategoricalDomain,Alias
from sklearn2pmml.preprocessing import ExpressionTransformer
from sklearn2pmml import sklearn2pmml

df=sns.load_dataset('tips')

X=df.drop('tip',axis=1)
y=df['tip']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=21)

mapper=DataFrameMapper([
    (['total_bill'],ContinuousDomain()),
    (['sex'],[CategoricalDomain(), LabelEncoder()]),
    (['smoker'],[CategoricalDomain(), LabelEncoder()]),
    (['day'],[CategoricalDomain(), OrdinalEncoder()]),
    (['time'],[CategoricalDomain(), OrdinalEncoder()]),
    (['size'],[CategoricalDomain(), OrdinalEncoder()]),
],df_out=True)

memory = Memory()
pipeline = PMMLPipeline([
    ('mapper',mapper),
    ("memorizer", make_memorizer_union(memory = memory, names = X.columns.values.tolist())),
    ("regressor", LinearRegression())
], predict_transformer = Pipeline([
    ('recaller1',make_recaller_union(memory = memory, names = X.columns.values.tolist(), position = "first")),

# changed the tip to 5 when sunday then tip should be five
('day_filter',ExpressionTransformer("5 if X[3]==2 else X[-1]"))
])
)
pipeline.fit(X_train, y_train)

# This yt has 2 column predictions, the 2nd column is what we have set decision for sunday and got as expected
yt=pipeline.predict_transform(X_test)

yt[:5]

[[4.1326351851513845, 5],
[3.705352817178983, 3.705352817178983],
[2.9473916039026093, 2.9473916039026093],
[4.2381344500335825, 5],
[2.8030321914896192, 5]]

# Saving the model
sklearn2pmml(pipeline,"tips_model.pmml")

# Load the model
from jpmml_evaluator import make_evaluator
evaluator = make_evaluator("tips_model.pmml") .verify()

# issue arises when loading and making predcitions I dont get the result as expected
result=evaluator.evaluateAll(X_test)

result[:5]

tip predict(tip) eval(5 if X[3]==2 else X[-1])
173 4.132635 4.132635 4.132635
240 3.705353 3.705353 3.705353
243 2.947392 2.947392 2.947392
175 4.238134 4.238134 4.238134
162 2.803032 2.803032 2.803032

Any suggestions on fixing this issue would be appreciated! Thanks in advance

vruusmann commented 4 months ago

Please let's:

Systematic violation of these elementary rules will get you banned from posting anything under the JPMML software project ever again.

vruusmann commented 4 months ago

Don't use excessive text styles in issue body

This was caused by not escaping the Python code block properly (using three backticks) - all Python code comments got promoted to heading styles.

vruusmann commented 4 months ago

# changed the tip to 5 when sunday then tip should be five ExpressionTransformer("5 if X[3]==2 else X[-1]")

If you open the PMML file in text editor, you'll see that the following PMML markup has been generated:

<OutputField name="eval(5 if X[3]==2 else X[-1])" optype="continuous" dataType="double" feature="transformedValue">
    <Apply function="if">
        <Apply function="equal">
            <FieldRef field="day"/>
            <Constant dataType="integer">2</Constant>
        </Apply>
        <Constant dataType="integer">5</Constant>
        <FieldRef field="predict(tip)"/>
    </Apply>
</OutputField>

Pay attention - you are comparing a categorical string field "day" (one of Fri, Sat, Sun, Thur) to an integer literal 2. This comparison is always bound to evaluate to False.

When using categorical fields, you can use two different approaches:

Right now, you're mixing the two styles (strings in original representation, comparisons against encoded numeric literals).

This would work:

ExpressionTransformer("5 if X['day']=='Sun' else X[-1]")

The PMML markup now makes sense:

<OutputField name="eval(5 if X['day']=='Sun' else X[-1])" optype="continuous" dataType="double" feature="transformedValue">
    <Apply function="if">
        <Apply function="equal">
            <FieldRef field="day"/>
            <Constant dataType="string">Sun</Constant>
        </Apply>
        <Constant dataType="integer">5</Constant>
        <FieldRef field="predict(tip)"/>
    </Apply>
</OutputField>
vruusmann commented 4 months ago
predict_transformer = Pipeline([
   ('recaller1',make_recaller_union(memory = memory, names = X.columns.values.tolist(), position = "first")),
   # changed the tip to 5 when sunday then tip should be five
   ('day_filter',ExpressionTransformer("5 if X[3]==2 else X[-1]"))
])

The source of the confusion is that you're memorizing and recalling using original input feature names (ie. X.columns.values.tolist(),). In PMML, they will map directly to /PMML/DataDictionary/DataField elements.

In reality, your pipeline works with transformed features. So you should use names (both in memorization and recall steps!) that do not conflict with original feature names. For example, "x1", "x2", "x3", etc.

vruusmann commented 4 months ago

One more important note - If the post-processor requires only one input feature, then you should memorize this one specific column, not all columns. Memorization consumes memory, so it's advisable not to store anything extra there.

A refactored workflow:

mapper = DataFrameMapper([
  # The post-processor needs "day" input feature, so memorize it.
  # Place the memorizer meta-transformer after the decorator step, but before any Scikit-Learn encoder steps
  (['day'],[CategoricalDomain(), make_memorizer_union(memory, names = ["day-str"]), OrdinalEncoder()]),
], df_out = True)

predict_transformer = Pipeline([
  ('recaller1',make_recaller_union(memory = memory, names = ["day-str"], position = "first")),
  # changed the tip to 5 when sunday then tip should be five
  ('day_filter', ExpressionTransformer("5 if X[0] == 'Sun' else X[-1]"))
])

pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("regressor", regressor)
], predict_transformer = predict_transformer)

I'm also mapping the memorized input feature to a variable named "day-str" in order to avoid any confusion/conflict with the original feature name "day".