Open maximilianeber opened 11 months ago
After some digging, I think this might be related to missing categorical support — everything works as expected when using one-hot encoding in the preprocessor.
@xadupre I am happy to try filing a PR if you think it's a good idea to add support for categoricals. Wdyt?
I did not check their implementation recently but if scikit-learn supports categories the same way lightgbm does, I guess they use the rule if x in set(cat1, cat2, ...)
which is not supported by onnx. onnxmltools deals with that case by multiplying nodes (https://github.com/onnx/onnxmltools/blob/main/onnxmltools/convert/lightgbm/operator_converters/LightGbm.py#L841) but the best way would be to update onnx to supports that rule. That said, I do think it is a good idea to support categorical features.
The right of doing it is to implement the latest onnx specifications (https://github.com/onnx/onnx/pull/5874) and then to update onnxruntime to support it.
The probloem with one-hot encoding is Histogram Gradient boosting might learn weird interactions between each one-hot encoded feature during modeling. Therefore, it might not be the same as specifying that feature as categorical in model definition with categorical_features:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor
The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.
Sorry for being so late in replying. Sadly, we haven't found the capacity to contribute upstream this quarter. 👎
Therefore, it might not be the same as specifying that feature as categorical in model definition with
categorical_features
Agreed. The other downside of one-hot encoding is that you need a lot of memory when the cardinality of the categorical feature(s) is high.
The right of doing it is to implement the latest onnx specifications (onnx/onnx#5874) and then to update onnxruntime to support it.
I think an update to onnxruntime is pending review :)
Hi,
I am trying to build a standard pipeline for tabular data that works nicely with ONNX. Ideally, the pipeline would:
To keep debugging simple, I have built a pipeline that covers points 1-3. Preprocessing works fine, but
HistGradientBoostingClassifier
returns different predictions (see gist).Any ideas why this might happen? Are there known issues with
HistGradientBoostingClassifier
?Thank you!
Package versions: