Closed liuhuanshuo closed 1 year ago
It seems that adding an extra step after the classifier is not allowed?
Yes, this is a SCIKIT-LEARN LIMITATION (not a SkLearn2PMML limitation) - "a pipeline can contain at most one estimator (aka model) object, and if present, it must be located in the final position".
How can I implement adding extra rules after the classifier?
You can use PMMLPipeline.predict_transform(X)
, PMMLPipeline.predict_proba_transform(X)
, etc post-processing methods. IIRC, you've done it before.
Alternatively, if you wrap your estimator object into sklego.meta.EstimatorTransformer
object, then it becomes a transformer (implements fit_transform(X)
instead of fit_predict(X, y)
), and the above limitation is effectively cancelled. That is, you can have multiple estimator-disguised-as-transformer objects in your pipeline, plus they can appear in positions other that the final position.
Is it impossible to achieve?
It is impossible using canonical Scikit-Learn pipelines.
But if you're open to using 3rd-party extension packages, and doing some thinking for yourself, it's easy-peasy.
Thanks for the idea, I will look into it further
You can use PMMLPipeline.predict_transform(X), PMMLPipeline.predict_proba_transform(X), etc post-processing methods. IIRC, you've done it before.
It's actually different from what I've done before.
I used to add additional columns based on the predicted probability value.
For example, add a column score
. If the probability value is 0.1, set the value of score to 100.
This is very simple and can be achieved by setting predict_proba_transformer
But what we want to achieve now is to modify the probability value here according to a certain column of the original data
The difficulty is that after the classifier, the original columns have become two columns of probability values after the classifier, and they cannot be retrieved!
Of course, I realize now that this has nothing to do with sklearn2pmml, and I probably shouldn't be asking here.
It's actually different from what I've done before.
... meaning you've already done things that you thought were impossible to do!
But what we want to achieve now is to modify the probability value here according to a certain column of the original data
I understand your problem very well.
By default, Scikit-Learn pipelines are linear, which means that the output of one pipeline step is passed to the next pipeline step, and there is no way for this "next step" to reference initial data (you want to do this!), or the output of some arbitrary earlier step (I typically want to do this).
The solution should be obvious - make your Scikit-Learn pipeline non-linear. For example, by utilizing sklearn.pipeline.FeatureUnion
. There are 3rd party extension packages that give you much more powerful pipeline abstraction tools than FeatureUnion
.
Or you can develop a custom (meta-)transformer class to achieve your goal.
I realize now that this has nothing to do with sklearn2pmml, and I probably shouldn't be asking here.
SkLearn2PMML is designed around the PMML abstraction of data flows - lazily evaluatable graph of computations (on scalar values). This is much richer than the Scikit-Learn abstraction of a linear sequence of computations.
If you ask here, you may get new ideas. It's up to you what happens after that.
hi,villu @vruusmann
After a day of continuous exploration, I finally found a way to achieve my needs. Although it may not be consistent with the expected needs, it can already satisfy me.
... meaning you've already done things that you thought were impossible to do!
This sentence gave me and my partner great encouragement
But now I have to ask you for help, because this largely needs to be solved by sklearn2pmml!
I created several custom classifiers and then used VotingClassifier to make the classification process not serial.
But I encountered a similar problem as before. It works fine in the pipeline, but it cannot be saved as a pmml file normally!
When i saved my pipeline, the following error occurred
Exception in thread "main" java.lang.IllegalArgumentException: List attribute 'sklearn.ensemble._voting.VotingClassifier.estimators_' contains an unsupported value (Python class __main__.CustomClassifier)
at org.jpmml.python.CastFunction.apply(CastFunction.java:47)
at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:651)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
at sklearn.ensemble.voting.VotingClassifier.encodeModel(VotingClassifier.java:59)
at sklearn.Estimator.encode(Estimator.java:118)
at sklearn2pmml.pipeline.PMMLPipeline.encodePMML(PMMLPipeline.java:187)
at com.sklearn2pmml.Main.run(Main.java:91)
at com.sklearn2pmml.Main.main(Main.java:66)
Caused by: java.lang.ClassCastException: Cannot cast net.razorvine.pickle.objects.ClassDict to sklearn.Classifier
at java.lang.Class.cast(Class.java:3369)
at org.jpmml.python.CastFunction.apply(CastFunction.java:45)
... 7 more
Because I have a lot of code, it is not convenient for me to show the code completely, but I believe that the code is caused by my custom class below. Is the custom classifier I wrote in this way unable to be saved? Can you see the problem?
from sklearn.base import BaseEstimator, ClassifierMixin
class CustomClassifier(BaseEstimator, ClassifierMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def predict(self, X):
X = X[['feature1','feature2','feature3']]
pred = [1 if not any(pd.isna(row)) else -99 for row in X.values]
return pred
def predict_proba(self, X):
X = X[['feature1','feature2','feature3']]
pred = [1 if not any(pd.isna(row)) else -99 for row in X.values]
return [[1-p, p] for p in pred]
If you need more information to determine the problem, you can let me know
Thanks again
You have defined a custom class __main__.CustomClassifier
, which is not registered with the JPMML-SkLearn backend, hence it is not recognized by the converter.
When I look at the business logic of this class, then it seems to be equivalent to the standard sklearn.dummy.DummyClassifier
estimator:
dummy_clf = DummyClassifier(strategy = ...)
Reading the opening comment of this issue, it seems to me that your general workflow would be something like this:
if X[11] > 0:
return dummy_clf.predict_proba(X)
else:
return my_normal_pipeline.predict_proba(X)
It is possible to construct "conditionally evaluated" estimator ensembles using the sklearn2pmml.ensemble.EstimatorChain
meta-estimator class:
from sklearn2pmml.ensemble import EstimatorChain
classifier = EstimatorChain([
("dummy", dummy_cld, "X[11] > 0"),
("normal_pipeline", my_normal_pipeline, "X[11] <= 0")
], multioutput = False)
Alternatively, could use the sklearn2pmml.ensemble.SelectFirstClassifier
meta-estimator class.
I'm not sure if you find this comment helpful or not, just trying to connect relevant pieces without needing to define any custom estimator types (eg. __main__.CustomClassifier
).
Yes, what I need to achieve is as you said.
In essence, I just need two classifiers, one of which is used to judge rows that meet certain conditions as 0
I followed your tip and used DummyClassifier(strategy = 'constant',constant=0) and it seemed to work fine, but when I combined it with EstimatorChain I got a disastrous result
AttributeError: 'EstimatorChain' object has no attribute 'predict_proba'
You have defined a custom class main.CustomClassifier, which is not registered with the JPMML-SkLearn backend, hence it is not recognized by the converter.
I would still like to ask if it is possible to save the CustomClassifier that I defined myself above, since I'm just having trouble saving it.
Is it possible to register it as something sklearn2pmml recognizes!
Here's the thing, I'm building a machine learning pipeline using PMMLPipeline
Because I need to correct the prediction probability according to the value of a certain column after the classifier predicts.
For example, I have 12 columns of data, I need to use the first 11 columns for prediction, and then correct the prediction result according to the value of the 12th column (if the twelfth column is greater than 0, correct the prediction result to 0)
So my code is written as follows
But it prompts the following error
It seems that adding an extra step after the classifier is not allowed?
How can I implement adding extra rules after the classifier? Because I noticed that the result after the classifier seems to be an array
I expect to use the values of certain columns to modify the probability that the prediction is 0, 1. Is it impossible to achieve?