categorical plots - unused categories mess up element spacing and width

Gabriel-Kissin commented 1 month ago

Several of seaborn's functions for plotting categorical data don't cope well when the categories list includes unused categories.

I've noticed two main issues:
1) element width shrinks 2) element spacing doesn't match the x-axis.

It doesn't make a difference if you use vertical or horizontal orientation.

The issue only occurs when the same feature is used for the categorical x/y variable and for the hue. If no hue is provided, or if the hue uses a different feature, there is no issue.

The issues occur for sns.barplot, sns.boxplot, sns.boxenplot, sns.violinplot. Whereas sns.pointplot, sns.stripplot, sns.swarmplot are fine.

I've reproduced the issue with the penguins dataset we all know and love from the seaborn docs. In the following MRE, the first col is the raw penguins data. The second col is after converting it to categorical (also works fine). The final col is after adding an unused category to the data, which causes the above two issues:

It looks as though it is failing to recognise that the hue and y are the same, so it makes space on the plot within each y for all the hues. This is what makes each element a) get squeezed, and b) not align nicely with the y ticks. Presumably the unused category is somehow the cause of the confusion.

Code to generate the above plot:

import matplotlib.pyplot as plt
import seaborn as sns

penguins = sns.load_dataset("penguins")

plotters = [sns.barplot, sns.boxplot, sns.boxenplot, sns.violinplot, 
            sns.pointplot, sns.stripplot, sns.swarmplot]

# with horizontal orientation
fig, axs = plt.subplots(ncols=3, nrows=len(plotters), figsize=(16, 3*len(plotters)), sharex=True, sharey=False)
kwargs = dict(data=penguins, x="body_mass_g", y="island", hue="island", legend=False,)

# If no hue is provided, or if the hue uses a different feature, there is no issue.
# kwargs = dict(data=penguins, x="body_mass_g", y="island", hue="sex", legend=True,)
# kwargs = dict(data=penguins, x="body_mass_g", y="island", legend=False,)

# same issue with vertical orientation
# fig, axs = plt.subplots(ncols=3, nrows=len(plotters), figsize=(16, 3*len(plotters)), sharex=False, sharey=True)
# kwargs = dict(data=penguins, x="island", y="body_mass_g", hue="island", legend=False,)

for i, plotter in enumerate(plotters):

    axs[i, 1].set_title(plotter.__name__)

    plotter(ax=axs[i, 0], **kwargs)

    cat_cols = penguins.select_dtypes('O').columns
    penguins[cat_cols] = penguins[cat_cols].astype('category')
    plotter(ax=axs[i, 1], **kwargs)

    penguins["island"] = penguins["island"].cat.add_categories(['Uninhabited Island '])
    plotter(ax=axs[i, 2], **kwargs)
    penguins["island"] = penguins["island"].cat.remove_unused_categories()

plt.tight_layout()
plt.show()

Many thanks as always for the superb library!

mwaskom commented 1 month ago

I think you want to set dodge=False here.

Gabriel-Kissin commented 1 month ago

Right - that indeed fixes it, thanks! - though perhaps the default dodge='auto' should recognise that the hue and categorical / orient variable are still the same, and therefore set dodge=False automatically?

mwaskom commented 1 month ago

Yeah — determining whether dodge is needed is a surprisingly hard problem. Here's the code that's currently doing it; not sure why it isn't working with your example.

jhncls commented 1 month ago

The reason that _dodge_needed() doesn't work as expected seems to be pandas' .value_counts() behaving differently when one or multiple columns are counted. With one column, there is a value count for each of the categories. With multiple columns, the categories are ignored, and only non-zero counts of combinations are reported.

Using following modified dataframe for testing:

import seaborn as sns

penguins = sns.load_dataset('penguins')
penguins['island'] = penguins['island'].astype('category')
penguins['island'] = penguins['island'].cat.add_categories(['Uninhabited Island'])
penguins['hue_col'] = penguins['island']

Then penguins[['island']].value_counts() gives a series with one index:

island            
Biscoe                168
Dream                 124
Torgersen              52
Uninhabited Island      0
Name: count, dtype: int64

And penguins[['island', 'hue_col']].value_counts() gives a series with two indices, counting the pairs:

island     hue_col  
Biscoe     Biscoe       168
Dream      Dream        124
Torgersen  Torgersen     52
Name: count, dtype: int64

Changing the test in _dodge_needed() from return orient.size != paired.size to return np.count_nonzero(orient) != np.count_nonzero(paired) would probably solve the issue.

mwaskom / seaborn

categorical plots - unused categories mess up element spacing and width #3736