mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.5k stars 1.92k forks source link

FacetGrid fails if dataframe contains unused categories #929

Closed mikepqr closed 8 years ago

mikepqr commented 8 years ago

FacetGrid fails with an unhelpful matplotlib exception if the column being conditioned on is a pandas categorical, and not all categoricals are used.

Here's an example

def neighborhood(borough):
    neighborhoods = {'A': lambda: np.random.choice(list('abcd')),
                     'B': lambda: np.random.choice(list('efgh')),
                     'C': lambda: np.random.choice(list('ijkl')),
                     'D': lambda: np.random.choice(list('mnop'))}
    return neighborhoods[borough]()

n = 10000
df = pd.DataFrame({'val': np.random.randn(n),
                   'borough': np.random.choice(list("ABCD"), size=n)})
df['neighborhood'] = df['borough'].apply(neighborhood)

This sets up a dataframe with random float values and, borough and neighborhood columns (strings).

g = sns.FacetGrid(df.query("borough == 'A'"), col='neighborhood', col_wrap=4)
g.map(plt.hist, 'val')

works fine, which is to say 4 plots are generated, one for each of the neighborhoods in borough "A".

But if we convert neighborhood to a categorical it fails, presumably because the DataFrame passed to FacetGrid has a mismatch between the categories that actually appear in the DataFrame and the contents of df['neighborhood'].cat.categories :

df['neighborhood'] = df['neighborhood'].astype('category')
g = sns.FacetGrid(df.query("borough == 'A'"), col='neighborhood', col_wrap=4)
g.map(plt.hist, 'val')

with the exception:

g = sns.FacetGrid(df.query("borough == 'A'"), col='neighborhood', col_wrap=4)
g.map(plt.hist, 'val')
g = sns.FacetGrid(df.query("borough == 'A'"), col='neighborhood', col_wrap=4)
g.map(plt.hist, 'val')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-1275d2e2ef07> in <module>()
----> 1 g = sns.FacetGrid(df.query("borough == 'A'"), col='neighborhood', col_wrap=4)
      2 g.map(plt.hist, 'val')

/Users/mike/.virtualenvs/ds3/lib/python3.5/site-packages/seaborn/axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, size, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws)
    321                 subplot_kws["sharey"] = axes[0]
    322             for i in range(1, n_axes):
--> 323                 axes[i] = fig.add_subplot(nrow, ncol, i + 1, **subplot_kws)
    324             self.axes = axes
    325 

/Users/mike/.virtualenvs/ds3/lib/python3.5/site-packages/matplotlib/figure.py in add_subplot(self, *args, **kwargs)
   1003                     self._axstack.remove(ax)
   1004 
-> 1005             a = subplot_class_factory(projection_class)(self, *args, **kwargs)
   1006 
   1007         self._axstack.add(key, a)

/Users/mike/.virtualenvs/ds3/lib/python3.5/site-packages/matplotlib/axes/_subplots.py in __init__(self, fig, *args, **kwargs)
     62                     raise ValueError(
     63                         "num must be 1 <= num <= {maxn}, not {num}".format(
---> 64                             maxn=rows*cols, num=num))
     65                 self._subplotspec = GridSpec(rows, cols)[int(num) - 1]
     66                 # num - 1 for converting from MATLAB to python indexing

ValueError: num must be 1 <= num <= 4, not 5

It looks to me like the problem is that categorical_order() is returning all the categories, including the unused ones.

Changing categorical_order() to return only used categories fixes the problem, in the sense that I get the same result whether or not the column is categorical.

Given the semantics of pandas categories, you could make the case that seaborn should build an axes for every category, including the unused ones.

Should I submit a PR that plots only used categories? Or does someone smarter than me want to figure out how to makes plots for the unused categories?

mwaskom commented 8 years ago

I think this is just an inconsistency where https://github.com/mwaskom/seaborn/blob/master/seaborn/axisgrid.py#L274 should calculate the length of the col names and not the length of the unique categories.

mwaskom commented 8 years ago

And the fact that categorical_order returns unused categories is very much by design, so that should not be changed.

mikepqr commented 8 years ago

Fair enough. Can confirm

nrow = int(np.ceil(len(col_names) / col_wrap))

works for me. Would you like a PR?

mwaskom commented 8 years ago

Yes that would be great. It would be doubly-good if you can add a small test that fails with the current code but passes with the fix.

mikepqr commented 8 years ago

Thanks. PR sent!