Closed louis-huang closed 1 year ago
Hi @louis-huang! Hummingbird supports multiclass for xgboost. You can check a test here.
Hummingbird does not support categorical values for tree models. Can it be that the dataset has categorical features?
Hi @interesaaat , I want to try multi-label not multi-class model. This is an example from xgb: https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html
@louis-huang The tree representations for a multi-class model vs. a multi-label classification model are essentially the same, with a separate tree ensemble for each class. So the Hummingbird for multi-class should work in theory. Some preprocessing may be necessary.
Internally, XGBoost builds one model for each target similar to sklearn meta estimators, with the added benefit of reusing data and other integrated features like SHAP. .
Thanks @gorkemozkaya for chiming in. We already have some infra for multi-label for sklearn multioutput regression. Shouldn't be too hard I feel to make it work for xgboost as well. Contributions welcome! 😄
@interesaaat @louis-huang I think the problem is that Hummingbird is normalizing (i.e. applying softmax to) each row output so that the probabilities will add up to 1. If we could remove that normalization it would support multi-label. For now, I propose a workaround, by separating the n-way multi-label classifier into n separate binary classifiers, and the output probabilities match: notebook link
We support different post transformations. So maybe by setting POST_TRANSFORM
to None
it will remove the normalization step.
Thanks @interesaaat, this was helpful! I verified that the outputs match if we override the default post-transform by passing the extra_config = {'post_transform': 'LOGISTIC'}
argument, and then take the last n_classes
columns of the output.
By default it is using the SOFTMAX
post_transform, which is not the right transform for multi-label. The library needs to decide the transform based on the objective
attribute of the XGBoost classification model. I.e it should be changed such that multi:softprob
maps to SOFTMAX
, whereas binary:logistic
maps to LOGISTIC
. But it needs a slightly different version of LOGISTIC
that does not double the number of output columns. For now, we can just take the last n_classes
columns.
Thanks for pointing out the post_transform @interesaaat This helps us a lot! Thank you @gorkemozkaya to verify this and providing suggestions!
Hi, I tried to convert a xgboost model trained with multi-label data, but the prediction is not correct. I have a notebook here.
I'm actually not sure if multi-label is supported in hummingbird, please help to verify this. Thank you!