Closed woodly0 closed 5 months ago
What you need here is "define custom feature once, then refer to it by name many times".
This is implemented as sklearn2pmml.cross_reference
module:
https://github.com/jpmml/sklearn2pmml/tree/master/sklearn2pmml/cross_reference
Brief overview: https://openscoring.io/blog/2023/11/25/sklearn_feature_cross_references/
Your pipeline would thus become:
Feature definition:
from sklearn2pmml.cross_reference import Memory, make_memorizer_union
# Shared communication channel between different pipeline sections
memory = Memory()
definer = [DateDomain(), DateTimeFormatter("%a"), make_memorizer_union(memory, names = ["memorized_ts"])]
Then, whenever you want to use the feature again:
from sklearn2pmml.cross_reference import make_recaller_union
reuser = [make_recaller_union(memory, names = ["memorized_ts"]), OneHotEncoder(sparse_output=False)]
Thank you! I don't know if I understood correctly but the following works:
memory = Memory()
mapper = DataFrameMapper(
[
(
["some_ts"],
[
DateDomain(),
make_memorizer_union(memory, names=["memorized_ts"]),
DateTimeFormatter("%a"),
OneHotEncoder(sparse_output=False),
],
),
(
["some_dt"],
[
DateDomain(),
make_recaller_union(memory, names=["memorized_ts"]),
DaysSinceYearTransformer(1990),
ExpressionTransformer("numpy.floor((X[0] - X[1]) / 365)"),
],
{"alias": "age"},
),
]
)
Guess I was lucky for the second transformation ^^
How do we know what is X[0]
and X[1]
?
Does it make any sense what I did? I don't feel very comfortable about it
Couldn't we use something simpler, e.g.:
mapper = DataFrameMapper(
[
(
["some_ts"],
[DateDomain(), DateTimeFormatter("%a"), OneHotEncoder(sparse_output=False)],
),
(
["some_ts", "some_dt"],
[
MultiDomain([None, DateDomain()]), # avoid to redefine domain
DaysSinceYearTransformer(1990),
ExpressionTransformer("numpy.floor((X[0] - X[1]) / 365)"),
],
),
]
)
but that throws an error on fit:
UFuncTypeError: ufunc 'subtract' cannot use operands with types dtype('<M8[ns]') and dtype('O')
How do we know what is X[0] and X[1]?
The make_recaller_union
utility function takes an optional param position
, which lets you configure if the recalled column(s) should be prepended (position = "first"
) or appended (position = "last"
) to the input data matrix.
Your code doesn't set this param, so you get the default behaviour, which is position = "first"
. It now means that the X[0]
corresponds to some_ts
(ie. the recalled feature), and X[1]
corresponds to some_dt
(ie. the column that was selected by DataFrame mapper).
Does it make any sense what I did?
It totally makes sense!
The idea is to eliminate duplicate computations from the pipeline. You do the computation once - when its value is first needed - and then memorize it using a good mnemonic name (this name is only used as Python dict key; it does not get transferred over to the resulting PMML document). Then, the next time the same value is needed, you recall it from the memory (whereas previously you'd be performing exactly the same computation again).
Collecting the feature domain information (eg. using DataDomain
) is also a "computation". However, it is different from other expression-based computations in that it is allowed to take place only once (ie. cannot be overriden) - this is what the original exception message ("Field some_ts
is frozen for type information updates") is trying to tell you.
Couldn't we use something simpler
That's also a valid approach - you've correctly identified that you cannot apply (Date)Domain
to the same input column more than once.
This approach would work fine with simple column types such as floats, integers or strings.
However, it doesn't work with complex column types as anything date/datetime related, because in this case the (Date)Domain
is actually performing a computation - for example, converting a string datetime string to a Python datetime
object.
The conclusion would be that with complex columns you'd still need to do "memorization". The memorized column now already contains Python datetime
values, so by doing recall you can avoid doing exactly the same string-to-datetime parsing operation again.
but that throws an error on fit:
UFuncTypeError: ufunc 'subtract' cannot use operands with types dtype('<M8[ns]') and dtype('O')
This error means that you're trying to subtract a datetime string from a Python datetime
object.
Just remembered that it's possible to perform data type conversions also using the sklearn2pmml.preprocessing.CastTransformer
class. So, if you do MultiDomain([None, DataDomain()])
, then you must pass the first column through CastTransformer(dtype = "datetime64[D]")
before attempting any atrithmetic with Python datetime objects.
This issue reflects big gaps in SkLearn2PMML package documentation, rather than its actual Python/Java code.
Closing it as "fixed". However, feel free to extend this thread with more relevant comments/questions if need be.
Thank you for your explications. I understand the subject much better now.
Hi Villu, i have the same problem and I am using a much older version of sklearn2pmml-0.51.1.post1 (upgrade is not an option) is there a way to achieve same result without using sklearn2pmml.cross_reference? Open to all sorts of Hacks
i have the same problem ...
@04pallav What exactly is the problem? Can you provide some Python code example demonstrating it? If you actually get this far, then you should open a new issue about it.
This issue covers a multitude of topics. The OP was asking one thing, and then I was explaining several other (but seemingly related) things as well. So, from my perspective, the problem could be anything.
... and I am using a much older version of sklearn2pmml-0.51.1.post1 (upgrade is not an option)
The inability to upgrade from a legacy SkLearn2PMML package version seems like a major issue in itself.
Why can't you move forward from the 0.51.1 version? This seems like a completely arbitrary version, there don't seem to be any breaking changes introduced in subsequent version(s).
The underlying JPMML-SkLearn dependency saw a major upgrade (ie. 1.5.X -> 1.6.X) in the 0.57.0 version. For example, what stops you from upgrading to 0.56.2 version?
I want to produce pmml 4.3 because of which we are constrained to lower version of sklearn2pmml. I can probably upgrade to the latest version which supports 4.3 which is "PMML 4.3: Last compatible release SkLearn2PMML 0.56.2 which is based on JPMML-SkLearn 1.5.38." but it still wont have sklearn2pmml.cross_reference correct ?
My issue is exactly the same as @woodly0 : I have a column feature1 which i want to use twice and apply ContinuousDomain to twice because I want to use the column as a standalone feature and then again in an Expression Transformer in combination with feature2. I get the error Field "feature1" is frozen for type information updates when i try to do this. If i do not apply ContinuousDomain in the Expression Transformer block, raw values of feature1 are passed to ExpressionTransformer which I do not want.
Does this help ?
We have tested pmml 4.4 which is somehow slower during serving
I probably know what company you're working with. I thought your MLOps team had this "appears slower" thing figured out, but apparently not. They should re-contact me again, this isn't normal and expected behaviour.
I can probably upgrade to the latest version which supports 4.3 ... but it still wont have sklearn2pmml.cross_reference correct?
The sklearn2pmml.cross_reference
module was introduced in SkLearn2PMML 0.99.0 (September 2023).
I have a column feature1 which i want to use twice and apply ContinuousDomain to twice because I want to use the column as a standalone feature and then again in an Expression Transformer in combination with feature2.
In your pipeline, does the ContinuousDomain
decorator do any useful work (such as casting the data type of a feature, or doing invalid/missing value replacement)? Because if it's being used in its default configuration (eg. ContinuousDomain()
), then the second usage can be omitted without any changes to the resulting PMML document.
If the pipeline should include feature first in its raw form, and then in its transformed form (together with some other feature(s)), then I'd suggest the following layout:
mapper = DataFrameMapper([
# The domain of a feature should be asserted when it is first used.
(["feature1", "feature2"], [ContinuousDomain(...), ExpressionTransformer(...)]),
# The domain of feature1 has already been asserted. No need to re-assert it again.
# If the feature does not need any processing, then map it simply to a `None` transformer
(["feature1"], None)
])
Alternatively, it's possible to fork the data flow using the FeatureUnion
meta-transformer. One branch does the extra pre-processing, whereas the other branch is simply terminated by an identity transformer:
featureUnion = FeatureUnion([
# First branch - extra feature engineering on two input columns
("transformed", ExpressionTransformer(...)),
# Second branch - capture the first column, drop all remainder columns
("raw", ExpressionTransformer("X[0]"))
])
mapper = DataFrameMapper([
# The domain is asserted before forking the data flow with FeatureUnion
(["feature1", "feature2"], [ContinuousDomain(), featureUnion])
])
I get the error Field "feature1" is frozen for type information updates when i try to do this.
A domain decorator can be applied to a column exactly once. During this one application, the data scientist should assert all information that it has to offer (eg. what is the data type, how should missing and invalid values be handled etc.).
Logically, it does not make sense to re-apply a domain decorator for the second time, because the new assertion(s) could be in conflict with earlier assertion(s). For example, the data type of a column was first said to be int
, but now it wants to be float
or double
.
Thinking about it, then Scikit-Learn does allow columns to have such "polymorphic" nature (eg. integer in one sub-pipeline, and a floating-point number in other pipeline), because Scikit-Learn allows the data to enter the pipeline many times. However, in PMML, columns enter the pipeline only once (through the <Model>/MiningSchema
element), so it needs to have a frozen-like nature at that point. Of course, after a column gets past the initial entry point, its data type can be cast in any way possible.
One way to remedy the "${column} is frozen for type information updates" error would be to allow the domain decorator to be applied many times, given that all these instances have exactly the same configuration (ie. all specify the same data type, the same missing/invalid value treatment, etc). But then again, the generated PMML document would look exactly the same, when all those secondary, tertiary etc. assertions were simply omitted from the Scikit-Learn pipeline.
Thanks for the inputs @vruusmann . very informative and will keep in mind when fighting the fight for upgrade.
FeatureUnion([
("og1", ExpressionTransformer("X[0]")),
("og2", ExpressionTransformer("X[1]")),
("ratio_f2_f1", Alias(ExpressionTransformer("X[0]/X[1] if X[0]!=0 else -123456"), "ratio_f2_f1")),
]),
This worked for me with a minor caveat that I had to override get_feature_names method from FeatureUnion by a patch because the Transformer doesnt have _get_featurenames defined FeatureUnion.get_feature_names = custom_get_feature_names. I don't want to bother you further on this thread but maybe there is a more elegant solution
def get_feature_names(self):
"""Get feature names from all transformers.
Returns
-------
feature_names : list of strings
Names of the features produced by transform.
"""
feature_names = []
for name, trans, weight in self._iter():
if not hasattr(trans, 'get_feature_names'):
raise AttributeError("Transformer %s (type %s) does not "
"provide get_feature_names."
% (str(name), type(trans).__name__))
feature_names.extend([name + "__" + f for f in
trans.get_feature_names()])
return feature_names
very informative and will keep in mind when fighting the fight for upgrade.
Today, I've been thinking a little about adding a version
parameter to the sklearn2pmml.sklearn2pmml(..)
utility function, so that it would be possible to use the latest package version and still get 4.3 schema documents.
It pains me conceptually, but technically it shouldn't be too difficult, because the 4.3 and 4.4 schemas are quite close (downgrade to 4.2 would be slightly more difficult).
Not promising anything, but might work on it already this month.
I had to override get_feature_names method from FeatureUnion by a patch because the Transformer doesnt have get_feature_names defined.
The Domain.feature_names_in
attribute was added in SkLearn2PMML 0.105.0 (https://github.com/jpmml/sklearn2pmml/commit/1232f05c9fd42bf95a2ad37e55592417719d0aaa). The Alias.feature_names_in
attribute has been around much longer, since SkLearn2PMML 0.86.2 or so (https://github.com/search?q=repo%3Ajpmml%2Fsklearn2pmml+2b641eed428ae8&type=commits).
This attribute hasn't been added to ExpressionTransformer
yet.
ExpressionTransformer("X[0]/X[1] if X[0]!=0 else -123456")
This construct can be rewritten as ExpressionTransformer("X[0] / X[1]", invalid_value_treatment = "as_missing", defaultValue = -123456)
.
Right now, the rewritten construct appears much longer than the original. But as the complexity of the Python expression grows, the balance should shift.
The idea is that "division by zero" is detected automatically. You can catch such evaluation errors at the Apply
element level (using the Apply@invalidTreatmentMethod
attribute), and then steer the evaluation in a new path (here, re-classify the interim invalid value as missing value, and then replace it with the predefined default value).
Unfortunately, these ExpressionTransformer
attributes are absent in your SkLearn2PMML package version.
Today, I've been thinking a little about adding a version parameter to the sklearn2pmml.sklearn2pmml(..) utility function, so that it would be possible to use the latest package version and still get 4.3 schema documents.
@vruusmann I know this was no promises but is this something still on your list ? :)
@04pallav The support for earlier PMML schema versions in the form of a pmml_schema
parameter (eg. sklearn2pmml(.., pmml_schema = "4.3")
) is close to the top of my TODO list.
Been working with R converters over the past month. As soon as I get a break from it, I'll get back to Scikit-Learn work.
The first "version downgrade" implementation doesn't need to be super sophisticated. I'll simply generate in-memory PMML object, and then run some basic 4.3 version compatibility checks on it. If all checks clear, I'll write the in-memory PMML object out to a file. If some checks fail (meaning that there is some backwards-incompatible markup in use), then I'll print out the the appropriate info messages and fail with an error.
A very simple strategy, but should address >80% workflows, because PMML 4.3 and 4.4 are really-really close.
@04pallav There is SkLearn2PMML version 0.110.0 freshly available on PyPI, which supports "soft" PMML schema version downgrades (from the default 4.4 schema version) with the help of pmml_schema
parameter (decided to reserve the version
parameter for package's own upgrade/downgrade needs).
So, you should be able to use the sklearn2pmml.cross_reference
module in your workflows now!
This is awesome!! Thanks @vruusmann
So if we use sklearn2pmml 0.110.0 on training side to generate 4.3 using pmml_schema and use jpmml-evaluator 1.4.11 (which supports no higher than 4.3) on serving side, do you see any potential reason for conflict ?
if we use sklearn2pmml 0.110.0 on training side .. and use jpmml-evaluator 1.4.11 on serving side, do you see any potential reason for conflict?
@04pallav The generated PMML document does not have any Java library dependencies, so there should not be any low-level technical conflicts.
However, there might be some high-level logical conflicts. For example, JPMML-Evaluator 1.4.11 is such an old version, so it contains a few known bugs. If your newly generated PMML markup happens to trigger one of those (you couldn't generate such markup with SkLearn2PMML 0.51.1 due to its limited nature, but now you can), you'll get a simple evaluation error.
Anyway, my recommendation is that you should always embed a small verification dataset into each PMML document using the PMMLPipeline.verify(X)
method.
The JPMML-Evaluator library (even in its 1.4.11 version) has a Evaluator#verify()
method which should be invoked every time when loading a PMML document. If this verification passes cleanly, then you know you have everything working correctly (ie. the predictions are 100% reproducible between Python and (J)PMML environments). If the verification fails, then it means that you've hit some internal JPMML-Evaluator limitation (a bug, or an unsupported PMML markup) and you should really try upgrading it.
TLDR: Always do PMMLPipeline.verify(X)
on the converter side, and Evaluator#verify()
on the consumer side. If the model verification succeeds, you're guaranteed good.
Hello again!
Jumping right to the code:
I am using the
some_ts
input twice which is certainly the source of the problem.The above throws the error :
How can I re-use the same column without creating this error? I have seen #193 and tried to use
MultiDomain([None, DateDomain()])
but that doesn't seem to work either.