Ordinal vs nominal for categorical variables

tatianapolush commented 10 months ago

Dear Interpretml team, thank you very much for your great work and a very useful package.

I have a question regarding ‘ordinal’ vs ‘nominal’ (please see the picture). This variable should be treated as ‘nominal’, it was specified, but from the plot it looks like all categories are sorted in alphabetical order anyway. Here, it is an example, my original labels are also displayed sorted, and it seems that classes with low frequencies after sorting get the same scores as any previous class according to the alphabetical order.

Original class naming is not built based data logic, and I got ‘sorted’ as an ‘ordinal’ picture even with randomly generated names.

Q1: Could you please recommend how to be make that scores for rare classes are correct? I tried to run with min_samples_leaf = 2, but it still seems that low frequency classes get different scores depending on the sorting.

Q2: Is it expected for 'nominal' variables to be sorted as 'ordinal'?
Thank you.

example_lowfrq_cat

paulbkoch commented 10 months ago

Thanks @tatianapolush -- Glad to hear it has been useful to you.

When specifying the feature_types, are you passing in a list of the categories for the feature like ["A", "B", "C", D", ...] or a string like "nominal"? When you specify "nominal" there is no option to order the categories since the category ordering is not supposed to be relevant, but we do need to choose some ordering for the UI, so by default we sort them alphabetically since the ordering in the dataset isn't useful either. One little wrinkle in the current EBM implementation is that the ordering does in fact currently matter and adjacency currently causes this bleeding effect if you have small bins. LightGBM and I think XGBoost use Fisher to find splits in categoricals without this bleeding effect, but that is something we have yet to implement.

I do have 2 recommendations to try though: 1) There is a non-public option called "nominalprevalence" that will sort the categories by prevalence. I'm not sure what effect this will have in terms of improving accuracy, but give it a try. 2) We also have a "greediness" option in our __init_\. Try setting it initially to something like 0.75, then if you have time try some higher and lower hyperparameters from there. In some cases where you have high variance features alongside low variance ones I think the default cyclic boosting algorithm stops too early for the high variance features. Boosting with greediness should improve that.

If you find either of these options useful, please let us know. This is still a newer topic that we could benefit from having feedback on.

tatianapolush commented 10 months ago

Thank you very much for your help and your fast reply! I will explore more and write back.

antflyinginsectsauce commented 1 month ago

@paulbkoch I am running into the same issue, where I have a nominal feature that is treated as ordinal alphabetically. I have about 70 categories and 600,000 training points, but the data is very skewed so I think that's too much and is causing the bleeding effect. Questions:

Should I be concerned that this bleeding is also happening for categories like month of year (12 values), or day of week (7). From these graphs it a bit harder to spot if it's happening.
What do you recommend?

I've looked at your suggestions:

What do you mean exactly? I've tried setting the feature type as nominal_prevalence but that gives an error
Is that the same as the greedy_ratio parameter?

Many thanks in advance! I'm really enthusiastic about the results so far, thanks for your work on the package!

paulbkoch commented 1 month ago

Hi @antflyinginsectsauce -- The bleeding effect is definitely less of an issue in features with less categories, at least when all the categories have sufficient samples. Like you, I've also observed that it doesn't seem to have much effect with 10-12 categories, but when you get into the range of 30-50 categories it's more prominent.

Implementing categorical handling through Fischer and maybe some of the other more advanced options available in Catboost is a high priority at the moment. It's probably the single biggest model performance improvement left at this point for EBMs.

In the meantime, another option you could try is to use one-hot encoding. We generally don't recommend this because it's less human-understandable, but if it resolves the bleeding effect in the short term then maybe it's a net interpretability win.

On the greediness question, yes, in the latest code this parameter is now called greedy_ratio. The other change we've made is to turn it on by default, so if you're using one of the more recent releases it's already on.

antflyinginsectsauce commented 1 month ago

Hi @paulbkoch Thanks for your reply! I'm pleased to hear that this a high priority.

I've been trying binary encoding. I then add up the individual effects to turn it back into a bar chart for all 70 categories to make it interpretable again. It increases the accuracy compared to both not using the feature at all and using the feature as-is. However, interpretability of results is a bit inconsistent between runs.

I've tried one-hot encoding for the biggest categories, but it decreased accuracy. I'll try one-hot encoding all categories, but that's going to be a looooooot of features so I'm not super hopeful. :) I'll keep you posted!

interpretml / interpret

Ordinal vs nominal for categorical variables #499