jpmml / sklearn2pmml

Python library for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
685 stars 113 forks source link

PMML supports for choosing model when the condition is satisfied #110

Closed liupei101 closed 5 years ago

liupei101 commented 5 years ago

Hi, Contributors! I have workflow involving sklearn2pmml, which is listed below:

# Example
pipeline = PMMLPipeline([
    ("classifier", DecisionTreeClassifier())
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
sklearn2pmml(pipeline, "DecisionTreeIris.pmml", with_repr = True)

# My workflow 
pipeline = PMMLPipeline([
       {
           "X['Widths'] > 20": ("classifier", DecisionTreeClassifier()),
           "X['Widths'] < 20": ("classifier", XGBClassifier())
       }
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
sklearn2pmml(pipeline, "DecisionTreeIris.pmml", with_repr = True)

I searched for basic usage of sklearn2pmml, it can convert trained model to pmml. but I don't know how to implement my workflow!

Does sklearn2pmml support for choosing model when the condition is satisfied?

thx!

vruusmann commented 5 years ago

Does sklearn2pmml support for choosing model when the condition is satisfied?

Is your workflow valid Python/Scikit-Learn syntax in the first place?

PMML can represent it using the model segmentation approach: http://dmg.org/pmml/v4-3/MultipleModels.html

In brief, there would be a top-level MiningModel element, which contains a TreeModel and a MiningModel (that's for XGBoost) child elements. Both segments are associated with a predicate which determines if they should be selected or not.

In JPMML-SkLearn/SkLearn2PMML this can be implemented by introducing a custom estimator class.

vruusmann commented 5 years ago

Pseudo-code about this custom estimator class usage:

pipeline = PMMLPipeline([
  ("classifier", ModelSelector([
    ("X['Widths'] >= 20", DecisionTreeClassifier()),
    ("X['Widths'] < 20", XGBClassifier()),
  ]))
])

I wonder how you would fit such a workflow? Is the goal to split the training dataset between two child models already during the training?

liupei101 commented 5 years ago

Thank you very much at first ! I am so sorry for not explaining my problem clearly.

In fact, I want to make a web application for predicting risk for patients. The application should serve for two independent population(such as people with or without X-ray inspection) by using two corresponding predictive models.

So I should follow the logic below(pseudo-code):

if the patient with X-ray inspection:
    # trainset: (train_X_with_xray, train_y_with_xray)
    # base estimator: XGBoost Classifier
    # fitted by training data involving variables related to the result of X-ray inspection.
    Model1 = model(...)
    # predict
    risk = Model1.predict()
else if the patient without X-ray inspection:
    # trainset: (train_X_without_xray, train_y_without_xray)
    # base estimator: XGBoost Classifier
    # fitted by training data not involving variables related to the result of X-ray inspection.
    Model2 = model(...)
    # predict
    risk = Model2.predict()

Now I face the problem that I should use single PMML file to give result after inputting patient's information to PMML, but not use two PMML files(one for patient with X-ray inspection, the other for patient without X-ray inspection) combining with if-else in JavaScript at the front of web to reach my target!

@vruusmann Thanks for your Pseudo-code about this custom estimator class usage, I will get more about ModelSelector , or can you give some suggestions about the problem I face with for your convenience ?

Thank you very much!

vruusmann commented 5 years ago

This custom class should actually be named ModelChoice, because the suffix "Selector" has special meaning in Scikit-Learn already (feature selectors).

So, class ModelChoice should implement both fit() and predict() functionality:

This solution wouldn't be too difficult to implement, because there is a reusable predicate translator component already available: https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/javacc/predicate.jj

@liupei101 My schedule is pretty tight during the next week. If you want to speed things up, then you could prototype the Python side of ModelChoice class yourself.

vruusmann commented 5 years ago

Reopening, because this is an interesting functionality that should be implemented.

guleatoma commented 5 years ago

Hey! I have the exact same issue, I tried to handle it through preprocessing and Ruleset but couldn't make it work. Any update on this?

Thanks a lot.

avogels commented 5 years ago

Hello, I would be very interested in this feature as well! Thanks and regards.