onnx / sklearn-onnx

Convert scikit-learn models and pipelines to ONNX
Apache License 2.0
546 stars 99 forks source link

Divergence in HistGradientBoostingClassifier's scores #1051

Open maximilianeber opened 9 months ago

maximilianeber commented 9 months ago

Hi,

I am trying to build a standard pipeline for tabular data that works nicely with ONNX. Ideally, the pipeline would:

  1. Be based on boosted trees
  2. Gracefully support mixed types (categorical/numerical)
  3. Exploit boosted trees' native support for categoricals
  4. Exploit boosted trees' native support for missing values

To keep debugging simple, I have built a pipeline that covers points 1-3. Preprocessing works fine, but HistGradientBoostingClassifier returns different predictions (see gist).

Any ideas why this might happen? Are there known issues with HistGradientBoostingClassifier?

Thank you!

Package versions:

scikit-learn==1.3.*
skl2onnx==1.16.*
onnxruntime==1.16.*
maximilianeber commented 9 months ago

After some digging, I think this might be related to missing categorical support — everything works as expected when using one-hot encoding in the preprocessor.

@xadupre I am happy to try filing a PR if you think it's a good idea to add support for categoricals. Wdyt?

xadupre commented 9 months ago

I did not check their implementation recently but if scikit-learn supports categories the same way lightgbm does, I guess they use the rule if x in set(cat1, cat2, ...) which is not supported by onnx. onnxmltools deals with that case by multiplying nodes (https://github.com/onnx/onnxmltools/blob/main/onnxmltools/convert/lightgbm/operator_converters/LightGbm.py#L841) but the best way would be to update onnx to supports that rule. That said, I do think it is a good idea to support categorical features.

xadupre commented 6 months ago

The right of doing it is to implement the latest onnx specifications (https://github.com/onnx/onnx/pull/5874) and then to update onnxruntime to support it.

ogencoglu commented 4 months ago

The probloem with one-hot encoding is Histogram Gradient boosting might learn weird interactions between each one-hot encoded feature during modeling. Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor

maximilianeber commented 4 months ago

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

Sorry for being so late in replying. Sadly, we haven't found the capacity to contribute upstream this quarter. 👎

Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features

Agreed. The other downside of one-hot encoding is that you need a lot of memory when the cardinality of the categorical feature(s) is high.

adityagoel4512 commented 1 month ago

The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.

I think an update to onnxruntime is pending review :)