PMML supports for choosing model when the condition is satisfied

liupei101 commented 6 years ago

Hi, Contributors! I have workflow involving sklearn2pmml, which is listed below:

# Example
pipeline = PMMLPipeline([
    ("classifier", DecisionTreeClassifier())
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
sklearn2pmml(pipeline, "DecisionTreeIris.pmml", with_repr = True)

# My workflow 
pipeline = PMMLPipeline([
       {
           "X['Widths'] > 20": ("classifier", DecisionTreeClassifier()),
           "X['Widths'] < 20": ("classifier", XGBClassifier())
       }
])
pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])
sklearn2pmml(pipeline, "DecisionTreeIris.pmml", with_repr = True)

I searched for basic usage of sklearn2pmml, it can convert trained model to pmml. but I don't know how to implement my workflow!

Does sklearn2pmml support for choosing model when the condition is satisfied?

thx!

vruusmann commented 6 years ago

Does sklearn2pmml support for choosing model when the condition is satisfied?

Is your workflow valid Python/Scikit-Learn syntax in the first place?

PMML can represent it using the model segmentation approach: http://dmg.org/pmml/v4-3/MultipleModels.html

In brief, there would be a top-level MiningModel element, which contains a TreeModel and a MiningModel (that's for XGBoost) child elements. Both segments are associated with a predicate which determines if they should be selected or not.

In JPMML-SkLearn/SkLearn2PMML this can be implemented by introducing a custom estimator class.

vruusmann commented 6 years ago

Pseudo-code about this custom estimator class usage:

pipeline = PMMLPipeline([
  ("classifier", ModelSelector([
    ("X['Widths'] >= 20", DecisionTreeClassifier()),
    ("X['Widths'] < 20", XGBClassifier()),
  ]))
])

I wonder how you would fit such a workflow? Is the goal to split the training dataset between two child models already during the training?

liupei101 commented 6 years ago

Thank you very much at first ! I am so sorry for not explaining my problem clearly.

In fact, I want to make a web application for predicting risk for patients. The application should serve for two independent population(such as people with or without X-ray inspection) by using two corresponding predictive models.

So I should follow the logic below(pseudo-code):

if the patient with X-ray inspection:
    # trainset: (train_X_with_xray, train_y_with_xray)
    # base estimator: XGBoost Classifier
    # fitted by training data involving variables related to the result of X-ray inspection.
    Model1 = model(...)
    # predict
    risk = Model1.predict()
else if the patient without X-ray inspection:
    # trainset: (train_X_without_xray, train_y_without_xray)
    # base estimator: XGBoost Classifier
    # fitted by training data not involving variables related to the result of X-ray inspection.
    Model2 = model(...)
    # predict
    risk = Model2.predict()

Now I face the problem that I should use single PMML file to give result after inputting patient's information to PMML, but not use two PMML files(one for patient with X-ray inspection, the other for patient without X-ray inspection) combining with if-else in JavaScript at the front of web to reach my target!

@vruusmann Thanks for your Pseudo-code about this custom estimator class usage, I will get more about ModelSelector , or can you give some suggestions about the problem I face with for your convenience ?

Thank you very much！

vruusmann commented 6 years ago

This custom class should actually be named ModelChoice, because the suffix "Selector" has special meaning in Scikit-Learn already (feature selectors).

So, class ModelChoice should implement both fit() and predict() functionality:

During fit(), every member model is trained using a subset of the training dataset for which the predicate evaluated to True.
During predict(), the prediction is made using the first model for which the predicate evaluated to True.

This solution wouldn't be too difficult to implement, because there is a reusable predicate translator component already available: https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/javacc/predicate.jj

@liupei101 My schedule is pretty tight during the next week. If you want to speed things up, then you could prototype the Python side of ModelChoice class yourself.

vruusmann commented 6 years ago

Reopening, because this is an interesting functionality that should be implemented.

guleatoma commented 5 years ago

Hey! I have the exact same issue, I tried to handle it through preprocessing and Ruleset but couldn't make it work. Any update on this?

Thanks a lot.

avogels commented 5 years ago

Hello, I would be very interested in this feature as well! Thanks and regards.

jpmml / sklearn2pmml

PMML supports for choosing model when the condition is satisfied #110