mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.41k stars 1.91k forks source link

histplot with categorical values crashes with missing data, though numerical values work fine #2295

Closed mojones closed 3 years ago

mojones commented 3 years ago

Not sure if this is intended behaviour, but it caught me out due to the difference in handling numerical/categorical data. I note that drawing histograms of categorical data is labelled as experimental, so ignore/close if that explains it.

With numerical data histplot ignores NaN and plots the other values, this is the behaviour I would expect:

import numpy as np
import seaborn as sns

sns.histplot(
    [1.1, 1.2, 1.3, 1.4, np.nan]
)

but with categorical data it crashes:

import numpy as np
import seaborn as sns

sns.histplot(
    ['foo', 'foo', 'bar', np.nan]
)

# output
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)
   1519         try:
-> 1520             ret = self.converter.convert(x, self.units, self)
   1521         except Exception as e:

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in convert(value, unit, axis)
     60         # force an update so it also does type checking
---> 61         unit.update(values)
     62         return np.vectorize(unit._mapping.__getitem__, otypes=[float])(values)

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/category.py in update(self, data)
    210             # OrderedDict just iterates over unique values in data.
--> 211             cbook._check_isinstance((str, bytes), value=val)
    212             if convertible:

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/cbook/__init__.py in _check_isinstance(_types, **kwargs)
   2234         if not isinstance(v, types):
-> 2235             raise TypeError(
   2236                 "{!r} must be an instance of {}, not a {}".format(

TypeError: 'value' must be an instance of str or bytes, not a float

The above exception was the direct cause of the following exception:

ConversionError                           Traceback (most recent call last)
<ipython-input-61-b132ea7dca6c> in <module>
      2 import seaborn as sns
      3 
----> 4 sns.histplot(
      5     ['foo', 'foo', 'bar', np.nan]
      6 )

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in histplot(data, x, y, hue, weights, stat, bins, binwidth, binrange, discrete, cumulative, common_bins, common_norm, multiple, element, fill, shrink, kde, kde_kws, line_kws, thresh, pthresh, pmax, cbar, cbar_ax, cbar_kws, palette, hue_order, hue_norm, color, log_scale, legend, ax, **kwargs)
   1420     if p.univariate:
   1421 
-> 1422         p.plot_univariate_histogram(
   1423             multiple=multiple,
   1424             element=element,

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/distributions.py in plot_univariate_histogram(self, multiple, element, fill, common_norm, common_bins, shrink, kde, kde_kws, color, legend, line_kws, estimate_kws, **plot_kws)
    421 
    422         # First pass through the data to compute the histograms
--> 423         for sub_vars, sub_data in self.iter_data("hue", from_comp_data=True):
    424 
    425             # Prepare the relevant data

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in iter_data(self, grouping_vars, reverse, from_comp_data)
    965 
    966         if from_comp_data:
--> 967             data = self.comp_data
    968         else:
    969             data = self.plot_data

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/seaborn/_core.py in comp_data(self)
   1034                 axis = getattr(ax, f"{var}axis")
   1035 
-> 1036                 comp_var = axis.convert_units(self.plot_data[var])
   1037                 if axis.get_scale() == "log":
   1038                     comp_var = np.log10(comp_var)

~/.virtualenvs/drawingfromdata/lib/python3.8/site-packages/matplotlib/axis.py in convert_units(self, x)
   1520             ret = self.converter.convert(x, self.units, self)
   1521         except Exception as e:
-> 1522             raise munits.ConversionError('Failed to convert value(s) to axis '
   1523                                          f'units: {x!r}') from e
   1524         return ret

ConversionError: Failed to convert value(s) to axis units: 0    foo
1    foo
2    bar
3    NaN
Name: x, dtype: object
mwaskom commented 3 years ago

Interestingly, this works if the data are passed as a numpy array, but fails with a list or Series

mwaskom commented 3 years ago

Noting that the same basic inconsistency exists in matplotlib too

plt.bar(["a", "b", np.nan], [1, 2, 3])  # Fails, same error
plt.bar(np.array(["a", "b", np.nan]), [1, 2, 3])  # Succeeds

Interestingly,

plt.plot(["a", "b", np.nan], [1, 2, 3]) 

succeeds, but shows nan as a category, which is not what I would expect.

mwaskom commented 3 years ago

I think that supporting categorical data with missing values will either require upstream changes in matplotlib or seaborn defining its own converters and using those that handle missing data properly. While I think it may be necessary to take the latter route for planned updates to the categorical plotting module (which predates any support for categorical data in matplotlib) a downside would be less interoperability between seaborn and matplotlib plots.

mwaskom commented 3 years ago

I originally milestoned this for v0.11.1 but it seems like it might be more complicated than I expected and possibly requires/is best handled by upstream changes (https://github.com/matplotlib/matplotlib/issues/19139), so I unfortunately think this needs to be kicked down the road.