incorrect prediction from torchscript model converted from xgboost classifier trained with multi-label dataset

microsoft / hummingbird

Hummingbird compiles trained ML models into tensor computation for faster inference.

MIT License

3.36k stars 278 forks source link

incorrect prediction from torchscript model converted from xgboost classifier trained with multi-label dataset #719

Closed louis-huang closed 1 year ago

louis-huang commented 1 year ago

Hi, I tried to convert a xgboost model trained with multi-label data, but the prediction is not correct. I have a notebook here.

I'm actually not sure if multi-label is supported in hummingbird, please help to verify this. Thank you!

interesaaat commented 1 year ago

Hi @louis-huang! Hummingbird supports multiclass for xgboost. You can check a test here.

Hummingbird does not support categorical values for tree models. Can it be that the dataset has categorical features?

louis-huang commented 1 year ago

Hi @interesaaat , I want to try multi-label not multi-class model. This is an example from xgb: https://xgboost.readthedocs.io/en/stable/tutorials/multioutput.html

gorkemozkaya commented 1 year ago

@louis-huang The tree representations for a multi-class model vs. a multi-label classification model are essentially the same, with a separate tree ensemble for each class. So the Hummingbird for multi-class should work in theory. Some preprocessing may be necessary. Internally, XGBoost builds one model for each target similar to sklearn meta estimators, with the added benefit of reusing data and other integrated features like SHAP. .

interesaaat commented 1 year ago

Thanks @gorkemozkaya for chiming in. We already have some infra for multi-label for sklearn multioutput regression. Shouldn't be too hard I feel to make it work for xgboost as well. Contributions welcome! 😄

gorkemozkaya commented 1 year ago

@interesaaat @louis-huang I think the problem is that Hummingbird is normalizing (i.e. applying softmax to) each row output so that the probabilities will add up to 1. If we could remove that normalization it would support multi-label. For now, I propose a workaround, by separating the n-way multi-label classifier into n separate binary classifiers, and the output probabilities match: notebook link

interesaaat commented 1 year ago

We support different post transformations. So maybe by setting POST_TRANSFORM to None it will remove the normalization step.

gorkemozkaya commented 1 year ago

Thanks @interesaaat, this was helpful! I verified that the outputs match if we override the default post-transform by passing the extra_config = {'post_transform': 'LOGISTIC'} argument, and then take the last n_classes columns of the output.

By default it is using the SOFTMAX post_transform, which is not the right transform for multi-label. The library needs to decide the transform based on the objective attribute of the XGBoost classification model. I.e it should be changed such that multi:softprob maps to SOFTMAX, whereas binary:logistic maps to LOGISTIC. But it needs a slightly different version of LOGISTIC that does not double the number of output columns. For now, we can just take the last n_classes columns.

louis-huang commented 1 year ago

Thanks for pointing out the post_transform @interesaaat This helps us a lot! Thank you @gorkemozkaya to verify this and providing suggestions!