Closed andru98 closed 4 months ago
Please let's:
Systematic violation of these elementary rules will get you banned from posting anything under the JPMML software project ever again.
Don't use excessive text styles in issue body
This was caused by not escaping the Python code block properly (using three backticks) - all Python code comments got promoted to heading styles.
# changed the tip to 5 when sunday then tip should be five
ExpressionTransformer("5 if X[3]==2 else X[-1]")
If you open the PMML file in text editor, you'll see that the following PMML markup has been generated:
<OutputField name="eval(5 if X[3]==2 else X[-1])" optype="continuous" dataType="double" feature="transformedValue">
<Apply function="if">
<Apply function="equal">
<FieldRef field="day"/>
<Constant dataType="integer">2</Constant>
</Apply>
<Constant dataType="integer">5</Constant>
<FieldRef field="predict(tip)"/>
</Apply>
</OutputField>
Pay attention - you are comparing a categorical string field "day" (one of Fri
, Sat
, Sun
, Thur
) to an integer literal 2
. This comparison is always bound to evaluate to False
.
When using categorical fields, you can use two different approaches:
Right now, you're mixing the two styles (strings in original representation, comparisons against encoded numeric literals).
This would work:
ExpressionTransformer("5 if X['day']=='Sun' else X[-1]")
The PMML markup now makes sense:
<OutputField name="eval(5 if X['day']=='Sun' else X[-1])" optype="continuous" dataType="double" feature="transformedValue">
<Apply function="if">
<Apply function="equal">
<FieldRef field="day"/>
<Constant dataType="string">Sun</Constant>
</Apply>
<Constant dataType="integer">5</Constant>
<FieldRef field="predict(tip)"/>
</Apply>
</OutputField>
predict_transformer = Pipeline([ ('recaller1',make_recaller_union(memory = memory, names = X.columns.values.tolist(), position = "first")), # changed the tip to 5 when sunday then tip should be five ('day_filter',ExpressionTransformer("5 if X[3]==2 else X[-1]")) ])
The source of the confusion is that you're memorizing and recalling using original input feature names (ie. X.columns.values.tolist(),
). In PMML, they will map directly to /PMML/DataDictionary/DataField
elements.
In reality, your pipeline works with transformed features. So you should use names (both in memorization and recall steps!) that do not conflict with original feature names. For example, "x1", "x2", "x3", etc.
One more important note - If the post-processor requires only one input feature, then you should memorize this one specific column, not all columns. Memorization consumes memory, so it's advisable not to store anything extra there.
A refactored workflow:
mapper = DataFrameMapper([
# The post-processor needs "day" input feature, so memorize it.
# Place the memorizer meta-transformer after the decorator step, but before any Scikit-Learn encoder steps
(['day'],[CategoricalDomain(), make_memorizer_union(memory, names = ["day-str"]), OrdinalEncoder()]),
], df_out = True)
predict_transformer = Pipeline([
('recaller1',make_recaller_union(memory = memory, names = ["day-str"], position = "first")),
# changed the tip to 5 when sunday then tip should be five
('day_filter', ExpressionTransformer("5 if X[0] == 'Sun' else X[-1]"))
])
pipeline = PMMLPipeline([
("mapper", mapper),
("regressor", regressor)
], predict_transformer = predict_transformer)
I'm also mapping the memorized input feature to a variable named "day-str" in order to avoid any confusion/conflict with the original feature name "day".
I am able to access the attributes (POST PROCESSING) but i am getting wrong predictions while i load the model form pmml/xml file.Here is the code.
Any suggestions on fixing this issue would be appreciated! Thanks in advance