jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
686 stars 113 forks source link

Post-processing predicted probabilities using a helper (ie. exogenous) feature #369

Closed liuhuanshuo closed 1 year ago

liuhuanshuo commented 1 year ago

Here's the thing, I'm building a machine learning pipeline using PMMLPipeline

Because I need to correct the prediction probability according to the value of a certain column after the classifier predicts.

For example, I have 12 columns of data, I need to use the first 11 columns for prediction, and then correct the prediction result according to the value of the 12th column (if the twelfth column is greater than 0, correct the prediction result to 0)

So my code is written as follows

def modify_proba(X):
    X[:, 11][X[:, 11] > 0] = 0
    return X

model_1 = PMMLPipeline(
    steps=[
        ("mapper", mapper),
        ("classifier", clf_1),
        ('modify_proba', FunctionTransformer(modify_proba, validate=False))])

But it prompts the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-193-399ece4a3431> in <module>
      3         ("mapper", mapper),
      4         ("classifier", clf_1),
----> 5         ('modify_proba', FunctionTransformer(modify_proba, validate=False))])

~/.local/lib/python3.7/site-packages/sklearn2pmml/pipeline/__init__.py in __init__(self, steps, header, predict_transformer, predict_proba_transformer, apply_transformer, memory, verbose)
     54                 self.apply_transformer = apply_transformer
     55                 # SkLearn 0.24+
---> 56                 super(PMMLPipeline, self).__init__(steps = steps, memory = memory, verbose = verbose)
     57 
     58         def __repr__(self):

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
    132         self.memory = memory
    133         self.verbose = verbose
--> 134         self._validate_steps()
    135 
    136     def get_params(self, deep=True):

~/.local/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_steps(self)
    180                                 "transformers and implement fit and transform "
    181                                 "or be the string 'passthrough' "
--> 182                                 "'%s' (type %s) doesn't" % (t, type(t)))
    183 
    184         # We allow last estimator to be None as an identity transformation

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=4000,
                   multi_class='ovr', n_jobs=16, penalty='l2', random_state=0,
                   solver='lbfgs', tol=0.0001, verbose=3, warm_start=False)' (type <class 'sklearn.linear_model._logistic.LogisticRegression'>) doesn't

It seems that adding an extra step after the classifier is not allowed?

How can I implement adding extra rules after the classifier? Because I noticed that the result after the classifier seems to be an array

I expect to use the values of certain columns to modify the probability that the prediction is 0, 1. Is it impossible to achieve?

vruusmann commented 1 year ago

It seems that adding an extra step after the classifier is not allowed?

Yes, this is a SCIKIT-LEARN LIMITATION (not a SkLearn2PMML limitation) - "a pipeline can contain at most one estimator (aka model) object, and if present, it must be located in the final position".

How can I implement adding extra rules after the classifier?

You can use PMMLPipeline.predict_transform(X), PMMLPipeline.predict_proba_transform(X), etc post-processing methods. IIRC, you've done it before.

Alternatively, if you wrap your estimator object into sklego.meta.EstimatorTransformer object, then it becomes a transformer (implements fit_transform(X) instead of fit_predict(X, y)), and the above limitation is effectively cancelled. That is, you can have multiple estimator-disguised-as-transformer objects in your pipeline, plus they can appear in positions other that the final position.

Is it impossible to achieve?

It is impossible using canonical Scikit-Learn pipelines.

But if you're open to using 3rd-party extension packages, and doing some thinking for yourself, it's easy-peasy.

liuhuanshuo commented 1 year ago

Thanks for the idea, I will look into it further

You can use PMMLPipeline.predict_transform(X), PMMLPipeline.predict_proba_transform(X), etc post-processing methods. IIRC, you've done it before.

It's actually different from what I've done before.

I used to add additional columns based on the predicted probability value.

For example, add a column score. If the probability value is 0.1, set the value of score to 100.

This is very simple and can be achieved by setting predict_proba_transformer

But what we want to achieve now is to modify the probability value here according to a certain column of the original data

The difficulty is that after the classifier, the original columns have become two columns of probability values after the classifier, and they cannot be retrieved!

liuhuanshuo commented 1 year ago

Of course, I realize now that this has nothing to do with sklearn2pmml, and I probably shouldn't be asking here.

vruusmann commented 1 year ago

It's actually different from what I've done before.

... meaning you've already done things that you thought were impossible to do!

But what we want to achieve now is to modify the probability value here according to a certain column of the original data

I understand your problem very well.

By default, Scikit-Learn pipelines are linear, which means that the output of one pipeline step is passed to the next pipeline step, and there is no way for this "next step" to reference initial data (you want to do this!), or the output of some arbitrary earlier step (I typically want to do this).

The solution should be obvious - make your Scikit-Learn pipeline non-linear. For example, by utilizing sklearn.pipeline.FeatureUnion. There are 3rd party extension packages that give you much more powerful pipeline abstraction tools than FeatureUnion.

Or you can develop a custom (meta-)transformer class to achieve your goal.

I realize now that this has nothing to do with sklearn2pmml, and I probably shouldn't be asking here.

SkLearn2PMML is designed around the PMML abstraction of data flows - lazily evaluatable graph of computations (on scalar values). This is much richer than the Scikit-Learn abstraction of a linear sequence of computations.

If you ask here, you may get new ideas. It's up to you what happens after that.

liuhuanshuo commented 1 year ago

hi,villu @vruusmann

After a day of continuous exploration, I finally found a way to achieve my needs. Although it may not be consistent with the expected needs, it can already satisfy me.

... meaning you've already done things that you thought were impossible to do!

This sentence gave me and my partner great encouragement

But now I have to ask you for help, because this largely needs to be solved by sklearn2pmml!

I created several custom classifiers and then used VotingClassifier to make the classification process not serial.

But I encountered a similar problem as before. It works fine in the pipeline, but it cannot be saved as a pmml file normally!

When i saved my pipeline, the following error occurred

Exception in thread "main" java.lang.IllegalArgumentException: List attribute 'sklearn.ensemble._voting.VotingClassifier.estimators_' contains an unsupported value (Python class __main__.CustomClassifier)
    at org.jpmml.python.CastFunction.apply(CastFunction.java:47)
    at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:651)
    at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
    at sklearn.ensemble.voting.VotingClassifier.encodeModel(VotingClassifier.java:59)
    at sklearn.Estimator.encode(Estimator.java:118)
    at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:187)
    at com.sklearn2pmml.Main.run(Main.java:91)
    at com.sklearn2pmml.Main.main(Main.java:66)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Classifier
    at java.lang.Class.cast(Class.java:3369)
    at org.jpmml.python.CastFunction.apply(CastFunction.java:45)
    ... 7 more

Because I have a lot of code, it is not convenient for me to show the code completely, but I believe that the code is caused by my custom class below. Is the custom classifier I wrote in this way unable to be saved? Can you see the problem?

from sklearn.base import BaseEstimator, ClassifierMixin

class CustomClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def predict(self, X):

        X = X[['feature1','feature2','feature3']]
        pred = [1 if not any(pd.isna(row)) else -99 for row in X.values]
        return pred

    def predict_proba(self, X):
        X = X[['feature1','feature2','feature3']]
        pred = [1 if not any(pd.isna(row)) else -99 for row in X.values]
        return [[1-p, p] for p in pred]

If you need more information to determine the problem, you can let me know

Thanks again

vruusmann commented 1 year ago

You have defined a custom class __main__.CustomClassifier, which is not registered with the JPMML-SkLearn backend, hence it is not recognized by the converter.

When I look at the business logic of this class, then it seems to be equivalent to the standard sklearn.dummy.DummyClassifier estimator:

dummy_clf = DummyClassifier(strategy = ...)

Reading the opening comment of this issue, it seems to me that your general workflow would be something like this:

if X[11] > 0:
  return dummy_clf.predict_proba(X)
else:
  return my_normal_pipeline.predict_proba(X)

It is possible to construct "conditionally evaluated" estimator ensembles using the sklearn2pmml.ensemble.EstimatorChain meta-estimator class:

from sklearn2pmml.ensemble import EstimatorChain

classifier = EstimatorChain([
  ("dummy", dummy_cld, "X[11] > 0"),
  ("normal_pipeline", my_normal_pipeline, "X[11] <= 0")
], multioutput = False)

Alternatively, could use the sklearn2pmml.ensemble.SelectFirstClassifier meta-estimator class.

I'm not sure if you find this comment helpful or not, just trying to connect relevant pieces without needing to define any custom estimator types (eg. __main__.CustomClassifier).

liuhuanshuo commented 1 year ago

Yes, what I need to achieve is as you said.

In essence, I just need two classifiers, one of which is used to judge rows that meet certain conditions as 0

I followed your tip and used DummyClassifier(strategy = 'constant',constant=0) and it seemed to work fine, but when I combined it with EstimatorChain I got a disastrous result

AttributeError: 'EstimatorChain' object has no attribute 'predict_proba'

You have defined a custom class main.CustomClassifier, which is not registered with the JPMML-SkLearn backend, hence it is not recognized by the converter.

I would still like to ask if it is possible to save the CustomClassifier that I defined myself above, since I'm just having trouble saving it.

Is it possible to register it as something sklearn2pmml recognizes!