Divergence in HistGradientBoostingClassifier's scores

onnx / sklearn-onnx

Convert scikit-learn models and pipelines to ONNX

Apache License 2.0

557 stars 104 forks source link

Divergence in HistGradientBoostingClassifier's scores #1051

Open maximilianeber opened 11 months ago

maximilianeber commented 11 months ago

Hi,

I am trying to build a standard pipeline for tabular data that works nicely with ONNX. Ideally, the pipeline would:

Be based on boosted trees
Gracefully support mixed types (categorical/numerical)
Exploit boosted trees' native support for categoricals
Exploit boosted trees' native support for missing values

To keep debugging simple, I have built a pipeline that covers points 1-3. Preprocessing works fine, but HistGradientBoostingClassifier returns different predictions (see gist).

Any ideas why this might happen? Are there known issues with HistGradientBoostingClassifier?

Thank you!

Package versions:

scikit-learn==1.3.*
skl2onnx==1.16.*
onnxruntime==1.16.*

maximilianeber commented 10 months ago

After some digging, I think this might be related to missing categorical support — everything works as expected when using one-hot encoding in the preprocessor.

@xadupre I am happy to try filing a PR if you think it's a good idea to add support for categoricals. Wdyt?

xadupre commented 10 months ago

I did not check their implementation recently but if scikit-learn supports categories the same way lightgbm does, I guess they use the rule if x in set(cat1, cat2, ...) which is not supported by onnx. onnxmltools deals with that case by multiplying nodes (https://github.com/onnx/onnxmltools/blob/main/onnxmltools/convert/lightgbm/operator_converters/LightGbm.py#L841) but the best way would be to update onnx to supports that rule. That said, I do think it is a good idea to support categorical features.

xadupre commented 7 months ago

The right of doing it is to implement the latest onnx specifications (https://github.com/onnx/onnx/pull/5874) and then to update onnxruntime to support it.

ogencoglu commented 5 months ago

The probloem with one-hot encoding is Histogram Gradient boosting might learn weird interactions between each one-hot encoded feature during modeling. Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor

maximilianeber commented 5 months ago

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

Sorry for being so late in replying. Sadly, we haven't found the capacity to contribute upstream this quarter. 👎

Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features

Agreed. The other downside of one-hot encoding is that you need a lot of memory when the cardinality of the categorical feature(s) is high.

adityagoel4512 commented 3 months ago

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

I think an update to onnxruntime is pending review :)