jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
686 stars 113 forks source link

How to properly re-use raw features #423

Closed woodly0 closed 5 months ago

woodly0 commented 5 months ago

Hello again!

Jumping right to the code:

second_series = pd.Series(pd.date_range("2024-01-01 00:00:03", periods=4, freq="d"), name="some_ts")
year_series = pd.Series(pd.date_range("2000-01-01", periods=4, freq="YE"), name="some_dt")

X = pd.concat([second_series, year_series], axis=1)
y = pd.Series([0, 0, 1, 0])

mapper = DataFrameMapper(
    [
        (
            ["some_ts"],
            [
                DateDomain(),
                DateTimeFormatter("%a"),
                OneHotEncoder(sparse_output=False),
            ],
        ),
        (
            ["some_ts", "some_dt"],
            [
                DateDomain(),  
                DaysSinceYearTransformer(1999),
                ExpressionTransformer("numpy.floor((X[0] - X[1]) / 365)"),
            ],
            {"alias": "age"},
        ),
    ]
)

I am using the some_ts input twice which is certainly the source of the problem.

pmml_pipe = PMMLPipeline(
    [
        ("mapper", mapper),
        ("classifier", LogisticRegression()),
    ]
)
pmml_pipe.fit(X, y)
sklearn2pmml(pmml_pipe, "out.pmml")

The above throws the error :

Standard output is empty
Standard error:
Exception in thread "main" java.lang.IllegalArgumentException: Field some_ts is frozen for type information updates
    at sklearn2pmml.decoration.Domain.updateDataField(Domain.java:124)
    at sklearn.Transformer.refineWildcardFeature(Transformer.java:123)
    at sklearn.Transformer.updateFeatures(Transformer.java:105)
    at sklearn.Transformer.encode(Transformer.java:74)
    at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:67)
    at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:45)
    at sklearn.Initializer.encode(Initializer.java:59)
    at sklearn.Composite.encodeFeatures(Composite.java:111)
    at sklearn.Composite.initFeatures(Composite.java:254)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:112)
    at com.sklearn2pmml.Main.run(Main.java:80)
    at com.sklearn2pmml.Main.main(Main.java:65)

How can I re-use the same column without creating this error? I have seen #193 and tried to use MultiDomain([None, DateDomain()]) but that doesn't seem to work either.

vruusmann commented 5 months ago

What you need here is "define custom feature once, then refer to it by name many times".

This is implemented as sklearn2pmml.cross_reference module: https://github.com/jpmml/sklearn2pmml/tree/master/sklearn2pmml/cross_reference

Brief overview: https://openscoring.io/blog/2023/11/25/sklearn_feature_cross_references/

vruusmann commented 5 months ago

Your pipeline would thus become:

Feature definition:

from sklearn2pmml.cross_reference import Memory, make_memorizer_union

# Shared communication channel between different pipeline sections
memory = Memory()

definer = [DateDomain(), DateTimeFormatter("%a"), make_memorizer_union(memory, names = ["memorized_ts"])]

Then, whenever you want to use the feature again:

from sklearn2pmml.cross_reference import make_recaller_union

reuser = [make_recaller_union(memory, names = ["memorized_ts"]), OneHotEncoder(sparse_output=False)]
woodly0 commented 5 months ago

Thank you! I don't know if I understood correctly but the following works:

memory = Memory()

mapper = DataFrameMapper(
    [
        (
            ["some_ts"],
            [
                DateDomain(),
                make_memorizer_union(memory, names=["memorized_ts"]),
                DateTimeFormatter("%a"),
                OneHotEncoder(sparse_output=False),
            ],
        ),
        (
            ["some_dt"],
            [
                DateDomain(),
                make_recaller_union(memory, names=["memorized_ts"]),
                DaysSinceYearTransformer(1990),
                ExpressionTransformer("numpy.floor((X[0] - X[1]) / 365)"),
            ],
            {"alias": "age"},
        ),
    ]
)

Guess I was lucky for the second transformation ^^ How do we know what is X[0] and X[1]?

woodly0 commented 5 months ago

Does it make any sense what I did? I don't feel very comfortable about it

woodly0 commented 5 months ago

Couldn't we use something simpler, e.g.:

mapper = DataFrameMapper(
    [
        (
            ["some_ts"],
            [DateDomain(), DateTimeFormatter("%a"), OneHotEncoder(sparse_output=False)],
        ),
        (
            ["some_ts", "some_dt"],
            [
                MultiDomain([None, DateDomain()]),  # avoid to redefine domain
                DaysSinceYearTransformer(1990),
                ExpressionTransformer("numpy.floor((X[0] - X[1]) / 365)"),
            ],
        ),
    ]
)

but that throws an error on fit:

UFuncTypeError: ufunc 'subtract' cannot use operands with types dtype('<M8[ns]') and dtype('O')
vruusmann commented 5 months ago

How do we know what is X[0] and X[1]?

The make_recaller_union utility function takes an optional param position, which lets you configure if the recalled column(s) should be prepended (position = "first") or appended (position = "last") to the input data matrix.

Your code doesn't set this param, so you get the default behaviour, which is position = "first". It now means that the X[0] corresponds to some_ts (ie. the recalled feature), and X[1] corresponds to some_dt (ie. the column that was selected by DataFrame mapper).

vruusmann commented 5 months ago

Does it make any sense what I did?

It totally makes sense!

The idea is to eliminate duplicate computations from the pipeline. You do the computation once - when its value is first needed - and then memorize it using a good mnemonic name (this name is only used as Python dict key; it does not get transferred over to the resulting PMML document). Then, the next time the same value is needed, you recall it from the memory (whereas previously you'd be performing exactly the same computation again).

Collecting the feature domain information (eg. using DataDomain) is also a "computation". However, it is different from other expression-based computations in that it is allowed to take place only once (ie. cannot be overriden) - this is what the original exception message ("Field some_ts is frozen for type information updates") is trying to tell you.

vruusmann commented 5 months ago

Couldn't we use something simpler

That's also a valid approach - you've correctly identified that you cannot apply (Date)Domain to the same input column more than once.

This approach would work fine with simple column types such as floats, integers or strings.

However, it doesn't work with complex column types as anything date/datetime related, because in this case the (Date)Domain is actually performing a computation - for example, converting a string datetime string to a Python datetime object.

The conclusion would be that with complex columns you'd still need to do "memorization". The memorized column now already contains Python datetime values, so by doing recall you can avoid doing exactly the same string-to-datetime parsing operation again.

vruusmann commented 5 months ago

but that throws an error on fit:

UFuncTypeError: ufunc 'subtract' cannot use operands with types dtype('<M8[ns]') and dtype('O')

This error means that you're trying to subtract a datetime string from a Python datetime object.

Just remembered that it's possible to perform data type conversions also using the sklearn2pmml.preprocessing.CastTransformer class. So, if you do MultiDomain([None, DataDomain()]), then you must pass the first column through CastTransformer(dtype = "datetime64[D]") before attempting any atrithmetic with Python datetime objects.

vruusmann commented 5 months ago

This issue reflects big gaps in SkLearn2PMML package documentation, rather than its actual Python/Java code.

Closing it as "fixed". However, feel free to extend this thread with more relevant comments/questions if need be.

woodly0 commented 5 months ago

Thank you for your explications. I understand the subject much better now.

04pallav commented 4 months ago

Hi Villu, i have the same problem and I am using a much older version of sklearn2pmml-0.51.1.post1 (upgrade is not an option) is there a way to achieve same result without using sklearn2pmml.cross_reference? Open to all sorts of Hacks

vruusmann commented 4 months ago

i have the same problem ...

@04pallav What exactly is the problem? Can you provide some Python code example demonstrating it? If you actually get this far, then you should open a new issue about it.

This issue covers a multitude of topics. The OP was asking one thing, and then I was explaining several other (but seemingly related) things as well. So, from my perspective, the problem could be anything.

... and I am using a much older version of sklearn2pmml-0.51.1.post1 (upgrade is not an option)

The inability to upgrade from a legacy SkLearn2PMML package version seems like a major issue in itself.

Why can't you move forward from the 0.51.1 version? This seems like a completely arbitrary version, there don't seem to be any breaking changes introduced in subsequent version(s).

The underlying JPMML-SkLearn dependency saw a major upgrade (ie. 1.5.X -> 1.6.X) in the 0.57.0 version. For example, what stops you from upgrading to 0.56.2 version?

04pallav commented 4 months ago

I want to produce pmml 4.3 because of which we are constrained to lower version of sklearn2pmml. I can probably upgrade to the latest version which supports 4.3 which is "PMML 4.3: Last compatible release SkLearn2PMML 0.56.2 which is based on JPMML-SkLearn 1.5.38." but it still wont have sklearn2pmml.cross_reference correct ?

My issue is exactly the same as @woodly0 : I have a column feature1 which i want to use twice and apply ContinuousDomain to twice because I want to use the column as a standalone feature and then again in an Expression Transformer in combination with feature2. I get the error Field "feature1" is frozen for type information updates when i try to do this. If i do not apply ContinuousDomain in the Expression Transformer block, raw values of feature1 are passed to ExpressionTransformer which I do not want.

Does this help ?

vruusmann commented 4 months ago

We have tested pmml 4.4 which is somehow slower during serving

I probably know what company you're working with. I thought your MLOps team had this "appears slower" thing figured out, but apparently not. They should re-contact me again, this isn't normal and expected behaviour.

I can probably upgrade to the latest version which supports 4.3 ... but it still wont have sklearn2pmml.cross_reference correct?

The sklearn2pmml.cross_reference module was introduced in SkLearn2PMML 0.99.0 (September 2023).

I have a column feature1 which i want to use twice and apply ContinuousDomain to twice because I want to use the column as a standalone feature and then again in an Expression Transformer in combination with feature2.

In your pipeline, does the ContinuousDomain decorator do any useful work (such as casting the data type of a feature, or doing invalid/missing value replacement)? Because if it's being used in its default configuration (eg. ContinuousDomain()), then the second usage can be omitted without any changes to the resulting PMML document.

If the pipeline should include feature first in its raw form, and then in its transformed form (together with some other feature(s)), then I'd suggest the following layout:

mapper = DataFrameMapper([
  # The domain of a feature should be asserted when it is first used.
  (["feature1", "feature2"], [ContinuousDomain(...), ExpressionTransformer(...)]),
  # The domain of feature1 has already been asserted. No need to re-assert it again.
  # If the feature does not need any processing, then map it simply to a `None` transformer
  (["feature1"], None)
])

Alternatively, it's possible to fork the data flow using the FeatureUnion meta-transformer. One branch does the extra pre-processing, whereas the other branch is simply terminated by an identity transformer:

featureUnion = FeatureUnion([
  # First branch - extra feature engineering on two input columns
  ("transformed", ExpressionTransformer(...)),
  # Second branch - capture the first column, drop all remainder columns
  ("raw", ExpressionTransformer("X[0]"))
])

mapper = DataFrameMapper([
  # The domain is asserted before forking the data flow with FeatureUnion
  (["feature1", "feature2"], [ContinuousDomain(), featureUnion])
])
vruusmann commented 4 months ago

I get the error Field "feature1" is frozen for type information updates when i try to do this.

A domain decorator can be applied to a column exactly once. During this one application, the data scientist should assert all information that it has to offer (eg. what is the data type, how should missing and invalid values be handled etc.).

Logically, it does not make sense to re-apply a domain decorator for the second time, because the new assertion(s) could be in conflict with earlier assertion(s). For example, the data type of a column was first said to be int, but now it wants to be float or double.

Thinking about it, then Scikit-Learn does allow columns to have such "polymorphic" nature (eg. integer in one sub-pipeline, and a floating-point number in other pipeline), because Scikit-Learn allows the data to enter the pipeline many times. However, in PMML, columns enter the pipeline only once (through the <Model>/MiningSchema element), so it needs to have a frozen-like nature at that point. Of course, after a column gets past the initial entry point, its data type can be cast in any way possible.

One way to remedy the "${column} is frozen for type information updates" error would be to allow the domain decorator to be applied many times, given that all these instances have exactly the same configuration (ie. all specify the same data type, the same missing/invalid value treatment, etc). But then again, the generated PMML document would look exactly the same, when all those secondary, tertiary etc. assertions were simply omitted from the Scikit-Learn pipeline.

04pallav commented 4 months ago

Thanks for the inputs @vruusmann . very informative and will keep in mind when fighting the fight for upgrade.

            FeatureUnion([
                ("og1", ExpressionTransformer("X[0]")),
                ("og2", ExpressionTransformer("X[1]")),
                ("ratio_f2_f1", Alias(ExpressionTransformer("X[0]/X[1] if X[0]!=0  else -123456"), "ratio_f2_f1")),
                ]),

This worked for me with a minor caveat that I had to override get_feature_names method from FeatureUnion by a patch because the Transformer doesnt have _get_featurenames defined FeatureUnion.get_feature_names = custom_get_feature_names. I don't want to bother you further on this thread but maybe there is a more elegant solution

    def get_feature_names(self):
        """Get feature names from all transformers.

        Returns
        -------
        feature_names : list of strings
            Names of the features produced by transform.
        """
        feature_names = []
        for name, trans, weight in self._iter():
            if not hasattr(trans, 'get_feature_names'):
                raise AttributeError("Transformer %s (type %s) does not "
                                     "provide get_feature_names."
                                     % (str(name), type(trans).__name__))
            feature_names.extend([name + "__" + f for f in
                                  trans.get_feature_names()])
        return feature_names
vruusmann commented 4 months ago

very informative and will keep in mind when fighting the fight for upgrade.

Today, I've been thinking a little about adding a version parameter to the sklearn2pmml.sklearn2pmml(..) utility function, so that it would be possible to use the latest package version and still get 4.3 schema documents.

It pains me conceptually, but technically it shouldn't be too difficult, because the 4.3 and 4.4 schemas are quite close (downgrade to 4.2 would be slightly more difficult).

Not promising anything, but might work on it already this month.

I had to override get_feature_names method from FeatureUnion by a patch because the Transformer doesnt have get_feature_names defined.

The Domain.feature_names_in attribute was added in SkLearn2PMML 0.105.0 (https://github.com/jpmml/sklearn2pmml/commit/1232f05c9fd42bf95a2ad37e55592417719d0aaa). The Alias.feature_names_in attribute has been around much longer, since SkLearn2PMML 0.86.2 or so (https://github.com/search?q=repo%3Ajpmml%2Fsklearn2pmml+2b641eed428ae8&type=commits).

This attribute hasn't been added to ExpressionTransformer yet.

ExpressionTransformer("X[0]/X[1] if X[0]!=0 else -123456")

This construct can be rewritten as ExpressionTransformer("X[0] / X[1]", invalid_value_treatment = "as_missing", defaultValue = -123456).

Right now, the rewritten construct appears much longer than the original. But as the complexity of the Python expression grows, the balance should shift.

The idea is that "division by zero" is detected automatically. You can catch such evaluation errors at the Apply element level (using the Apply@invalidTreatmentMethod attribute), and then steer the evaluation in a new path (here, re-classify the interim invalid value as missing value, and then replace it with the predefined default value).

Unfortunately, these ExpressionTransformer attributes are absent in your SkLearn2PMML package version.

04pallav commented 3 months ago

Today, I've been thinking a little about adding a version parameter to the sklearn2pmml.sklearn2pmml(..) utility function, so that it would be possible to use the latest package version and still get 4.3 schema documents.

@vruusmann I know this was no promises but is this something still on your list ? :)

vruusmann commented 3 months ago

@04pallav The support for earlier PMML schema versions in the form of a pmml_schema parameter (eg. sklearn2pmml(.., pmml_schema = "4.3")) is close to the top of my TODO list.

Been working with R converters over the past month. As soon as I get a break from it, I'll get back to Scikit-Learn work.

The first "version downgrade" implementation doesn't need to be super sophisticated. I'll simply generate in-memory PMML object, and then run some basic 4.3 version compatibility checks on it. If all checks clear, I'll write the in-memory PMML object out to a file. If some checks fail (meaning that there is some backwards-incompatible markup in use), then I'll print out the the appropriate info messages and fail with an error.

A very simple strategy, but should address >80% workflows, because PMML 4.3 and 4.4 are really-really close.

vruusmann commented 3 months ago

@04pallav There is SkLearn2PMML version 0.110.0 freshly available on PyPI, which supports "soft" PMML schema version downgrades (from the default 4.4 schema version) with the help of pmml_schema parameter (decided to reserve the version parameter for package's own upgrade/downgrade needs).

So, you should be able to use the sklearn2pmml.cross_reference module in your workflows now!

04pallav commented 3 months ago

This is awesome!! Thanks @vruusmann
So if we use sklearn2pmml 0.110.0 on training side to generate 4.3 using pmml_schema and use jpmml-evaluator 1.4.11 (which supports no higher than 4.3) on serving side, do you see any potential reason for conflict ?

vruusmann commented 3 months ago

if we use sklearn2pmml 0.110.0 on training side .. and use jpmml-evaluator 1.4.11 on serving side, do you see any potential reason for conflict?

@04pallav The generated PMML document does not have any Java library dependencies, so there should not be any low-level technical conflicts.

However, there might be some high-level logical conflicts. For example, JPMML-Evaluator 1.4.11 is such an old version, so it contains a few known bugs. If your newly generated PMML markup happens to trigger one of those (you couldn't generate such markup with SkLearn2PMML 0.51.1 due to its limited nature, but now you can), you'll get a simple evaluation error.

Anyway, my recommendation is that you should always embed a small verification dataset into each PMML document using the PMMLPipeline.verify(X) method.

The JPMML-Evaluator library (even in its 1.4.11 version) has a Evaluator#verify() method which should be invoked every time when loading a PMML document. If this verification passes cleanly, then you know you have everything working correctly (ie. the predictions are 100% reproducible between Python and (J)PMML environments). If the verification fails, then it means that you've hit some internal JPMML-Evaluator limitation (a bug, or an unsupported PMML markup) and you should really try upgrading it.

TLDR: Always do PMMLPipeline.verify(X) on the converter side, and Evaluator#verify() on the consumer side. If the model verification succeeds, you're guaranteed good.