Open RAMitchell opened 1 year ago
At this stage I think the path of least resistance is to output the shapley values only for the positive class. This is not ideal because generally we want shapley values to add up to the normal prediction output e.g. shapely_values.sum(axis=-1) == prediction_output.
There is no way to detect if the model is a regression or binary classification model from the information given in treelite
Would it be useful if Treelite stored a flag to indicate whether the model is a regression model? I can get it in for Treelite 4.0
Yes this is a good idea. This is currently only a problem for the random forest models. In the case of xgboost we can tell from the output transformation that it is classification, but the random forest classification uses the identity transform so we can't actually tell the difference.
I think this will become a problem in future as well for multi-output regression models. The current implementation assumes that all multi-output models are classification - it may be helpful for downstream applications to differentiate this.
In the case of a binary classification model from sklearn we expect the output for both positive and negative classes (this would be consistent with the normal prediction output). As the model is transferred in the treelite format it has num_classes set to 1 in the xgboost style and so the Shapley values are written as a single column. There is no way to detect if the model is a regression or binary classification model from the information given in treelite, so we cannot just mirror the output to correct the result on the triton-fil side without also causing this to happen for every regression model.