jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Support of trees without using noTrueChildStrategy=returnLastPrediction #143

Closed r1551z closed 4 years ago

r1551z commented 4 years ago

I noticed in the output XGBoost Tree pmml file, the code does not list all nodes. Below is an example tree. There is no node for 'A<1.5', or 'A<2.5'.

In my understanding, the non-listed nodes will not cause an issue with noTrueChildStrategy="returnLastPrediction". However, when the noTrueChildStrategy="returnLastPrediction" is not supported (we are using a software which imports pmml files, but does not support the usage of returnLastPrediction), the model won't generate a valid prediction.

I'm wondering if there is a way which would output all nodes, so even without using noTrueChildStrategy="returnLastPrediction", the model will still be good to use.

<TreeModel functionName="regression" noTrueChildStrategy="returnLastPrediction" x-mathContext="float">
  <MiningSchema>
    <MiningField name="A"/>
    <MiningField name="B"/>
  </MiningSchema>
  <Node score="0.051232874">
    <True/>
    <Node score="0.023004694">
      <SimplePredicate field="A" operator="greaterOrEqual" value="1.5"/>
      <Node score="0.05652174">
        <SimplePredicate field="A" operator="greaterOrEqual" value="2.5"/>
      </Node>
    </Node>
    <Node score="0.0030303032">
      <SimplePredicate field="B" operator="greaterOrEqual" value="4.5"/>
    </Node>
  </Node>
</TreeModel>
</Segment>
vruusmann commented 4 years ago

I noticed in the output XGBoost Tree pmml file, the code does not list all nodes.

Most application scenarios expect PMML documents to be as compact and concise as possible. In the current case, switching from binary tree representation to linearized & flattened tree representation saves 50% of storage and evaluation cost.

I'm wondering if there is a way which would output all nodes,

The decision tree representation (default binary split vs. optimized multi-way split) is controlled by conversion options.

The SkLearn2PMML package lets you configure this using the PMMLPipeline.configure(**pmml_options) method:

pipeline = PMMLPipeline([
  ("estimator", XGBClasifier())
])
pipeline.fit(X, y)

# XGBoost native binary split representation
pipeline.configure(compact = False)
sklearn2pmml(pipeline, "default_model.pmml")

# JPMML optimized representation
pipeline.configure(compact = True)
sklearn2pmml(pipeline, "optimized_model.pmml")
r1551z commented 4 years ago

Thank you very much. I also noticed that the when I directly put the XGBClassifier/LGBClassifier in the pipeline, using compact=False returns the binary tree representation. However, if I use a more complicated structure, putting them into a StackingClassifier, the trees seem to be compact again. Please see the python code below

allButOne = ColumnTransformer([(str(cont_index), "passthrough", [cont_index]) for cont_index in range(46)]+
  [(str(cont_index), "passthrough", [cont_index]) for cont_index in range(47, 57)])

onlyOne = ColumnTransformer([(str(cont_index), "passthrough", [cont_index]) for cont_index in [46]])

estimator1=Pipeline(steps=[('Process', allButOne),
                          ('Estimator',
                           LGBMClassifier()
                          )
                          ]
                   )

estimator2=Pipeline(steps=[
    ('Process', onlyOne),
                          ('Estimator',LogisticRegression(multi_class='multinomial'))])

estimator = StackingClassifier([
  ("first", estimator1),
  ("second", estimator2),

], final_estimator = LogisticRegression(multi_class='multinomial'))

pipeline= PMMLPipeline([ ("domain", DataFrameMapper([
    (list(X.columns), ContinuousDomain(invalid_value_treatment ='as_is'))
  ])),

  ("ensemble", estimator)
                         ])
pipeline.fit(X_tv.iloc[:, :], y_tv.iloc[:])                         
pipeline.configure(compact = False, flat = False, winner_id = True)
sklearn2pmml(pipeline, "pipeline.pmml")
vruusmann commented 4 years ago

Please see the python code below

This python code is closely related to that of https://github.com/jpmml/jpmml-sklearn/issues/141

The PMMLPipeline.configure(**pmml_options) method modifies the final estimator of the pipeline (by setting its pmml_options_ attribute to the **pmml_options dict).

The JPMML-SkLearn library is respecting the pmml_options_ attribute on all estimators in the pipeline. You can set it manually anytime, anywhere:

classifier = XGBClassifier()
classifier.pmml_options_ = dict(compact = False)