tritemio / pybroom

pybroom, the python's broom to tidy up messy fit results!
http://pybroom.readthedocs.io/
MIT License
14 stars 5 forks source link

Use pd.Categorical for index columns #3

Closed tritemio closed 8 years ago

tritemio commented 8 years ago

The "index" (or categorical) columns added by pybroom when the input is a collection of fit results should be of type pd.Categorical.

tritemio commented 8 years ago

When the input is a dict of fit results (keys are the categories, values are fit results) the "index" or "key" column in the returned DataFrame contains strings. This column should be converted to categorical, as it is naturally a category.

Conversely, when the input is a list of fit results the "index" is an integer. Initially I though to convert also this column to categorical type, since it can save some space (index are int64 while categorical use the smallest int for internal codes). However this approach causes problems with seaborn (see https://github.com/mwaskom/seaborn/issues/997). Briefly, seaborn always builds FacetGrid plots looking at all the categories in the given column. So when selecting fit results (let's say index < 6, as in the example notebook), seaborn will plot all the empty axes for the empty categories. The solution would be the remove the unused categories after selection, but this requires longer and more convoluted pandas commands.

There may be other subtle issue in using categorical for integer columns, with no clear benefit. Therefore I think is better to leave the integer columns added by pybroom (indicating index in a list of fit results) as integer type.