Closed r1551z closed 4 years ago
I noticed in the output XGBoost Tree pmml file, the code does not list all nodes.
Most application scenarios expect PMML documents to be as compact and concise as possible. In the current case, switching from binary tree representation to linearized & flattened tree representation saves 50% of storage and evaluation cost.
I'm wondering if there is a way which would output all nodes,
The decision tree representation (default binary split vs. optimized multi-way split) is controlled by conversion options.
The SkLearn2PMML package lets you configure this using the PMMLPipeline.configure(**pmml_options)
method:
pipeline = PMMLPipeline([
("estimator", XGBClasifier())
])
pipeline.fit(X, y)
# XGBoost native binary split representation
pipeline.configure(compact = False)
sklearn2pmml(pipeline, "default_model.pmml")
# JPMML optimized representation
pipeline.configure(compact = True)
sklearn2pmml(pipeline, "optimized_model.pmml")
Thank you very much. I also noticed that the when I directly put the XGBClassifier/LGBClassifier in the pipeline, using compact=False returns the binary tree representation. However, if I use a more complicated structure, putting them into a StackingClassifier, the trees seem to be compact again. Please see the python code below
allButOne = ColumnTransformer([(str(cont_index), "passthrough", [cont_index]) for cont_index in range(46)]+
[(str(cont_index), "passthrough", [cont_index]) for cont_index in range(47, 57)])
onlyOne = ColumnTransformer([(str(cont_index), "passthrough", [cont_index]) for cont_index in [46]])
estimator1=Pipeline(steps=[('Process', allButOne),
('Estimator',
LGBMClassifier()
)
]
)
estimator2=Pipeline(steps=[
('Process', onlyOne),
('Estimator',LogisticRegression(multi_class='multinomial'))])
estimator = StackingClassifier([
("first", estimator1),
("second", estimator2),
], final_estimator = LogisticRegression(multi_class='multinomial'))
pipeline= PMMLPipeline([ ("domain", DataFrameMapper([
(list(X.columns), ContinuousDomain(invalid_value_treatment ='as_is'))
])),
("ensemble", estimator)
])
pipeline.fit(X_tv.iloc[:, :], y_tv.iloc[:])
pipeline.configure(compact = False, flat = False, winner_id = True)
sklearn2pmml(pipeline, "pipeline.pmml")
Please see the python code below
This python code is closely related to that of https://github.com/jpmml/jpmml-sklearn/issues/141
The PMMLPipeline.configure(**pmml_options)
method modifies the final estimator of the pipeline (by setting its pmml_options_
attribute to the **pmml_options
dict).
The JPMML-SkLearn library is respecting the pmml_options_
attribute on all estimators in the pipeline. You can set it manually anytime, anywhere:
classifier = XGBClassifier()
classifier.pmml_options_ = dict(compact = False)
I noticed in the output XGBoost Tree pmml file, the code does not list all nodes. Below is an example tree. There is no node for 'A<1.5', or 'A<2.5'.
In my understanding, the non-listed nodes will not cause an issue with noTrueChildStrategy="returnLastPrediction". However, when the noTrueChildStrategy="returnLastPrediction" is not supported (we are using a software which imports pmml files, but does not support the usage of returnLastPrediction), the model won't generate a valid prediction.
I'm wondering if there is a way which would output all nodes, so even without using noTrueChildStrategy="returnLastPrediction", the model will still be good to use.