interpretml / interpret

Fit interpretable models. Explain blackbox machine learning.
https://interpret.ml/docs
MIT License
6.04k stars 714 forks source link

Ordinal vs nominal for categorical variables #499

Open tatianapolush opened 4 months ago

tatianapolush commented 4 months ago

Dear Interpretml team, thank you very much for your great work and a very useful package.

I have a question regarding ‘ordinal’ vs ‘nominal’ (please see the picture). This variable should be treated as ‘nominal’, it was specified, but from the plot it looks like all categories are sorted in alphabetical order anyway. Here, it is an example, my original labels are also displayed sorted, and it seems that classes with low frequencies after sorting get the same scores as any previous class according to the alphabetical order.

Original class naming is not built based data logic, and I got ‘sorted’ as an ‘ordinal’ picture even with randomly generated names.

Q1: Could you please recommend how to be make that scores for rare classes are correct? I tried to run with min_samples_leaf = 2, but it still seems that low frequency classes get different scores depending on the sorting.

Q2: Is it expected for 'nominal' variables to be sorted as 'ordinal'?
Thank you.

example_lowfrq_cat

paulbkoch commented 4 months ago

Thanks @tatianapolush -- Glad to hear it has been useful to you.

When specifying the feature_types, are you passing in a list of the categories for the feature like ["A", "B", "C", D", ...] or a string like "nominal"? When you specify "nominal" there is no option to order the categories since the category ordering is not supposed to be relevant, but we do need to choose some ordering for the UI, so by default we sort them alphabetically since the ordering in the dataset isn't useful either. One little wrinkle in the current EBM implementation is that the ordering does in fact currently matter and adjacency currently causes this bleeding effect if you have small bins. LightGBM and I think XGBoost use Fisher to find splits in categoricals without this bleeding effect, but that is something we have yet to implement.

I do have 2 recommendations to try though: 1) There is a non-public option called "nominalprevalence" that will sort the categories by prevalence. I'm not sure what effect this will have in terms of improving accuracy, but give it a try. 2) We also have a "greediness" option in our __init_\. Try setting it initially to something like 0.75, then if you have time try some higher and lower hyperparameters from there. In some cases where you have high variance features alongside low variance ones I think the default cyclic boosting algorithm stops too early for the high variance features. Boosting with greediness should improve that.

If you find either of these options useful, please let us know. This is still a newer topic that we could benefit from having feedback on.

tatianapolush commented 4 months ago

Thank you very much for your help and your fast reply! I will explore more and write back.