pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.69k stars 17.92k forks source link

ENH: Allow easy selection of ordered/unordered categorical columns #46941

Open richierocks opened 2 years ago

richierocks commented 2 years ago

Is your feature request related to a problem?

I'd like to be able to easily select only ordered categorical columns, or only unordered categorical columns, from a dataframe.

Example

Here's an example dataset:

import pandas as pd
import numpy.random as npr

n_obs = 20
eye_colors = ["blue", "brown"]
people = pd.DataFrame({
    "eye_color": npr.choice(eye_colors, size=n_obs),
    "age": npr.randint(20, 60, size=n_obs)
})
people["age_group"] = pd.cut(people["age"], [20, 30, 40, 50, 60], right=False)
people["eye_color"] = pd.Categorical(people["eye_color"], eye_colors)

Here, eye_color is an unordered categorical column, age_group is an ordered categorical column, and age is numeric. I want just the age_group column.

My best attempt at selecting ordered categorical columns is

categories = people.select_dtypes("category")
categories[[col for col in categories.columns if categories[col].cat.ordered]]

This solution feels overly complicated for such a simple task.

Describe the solution you'd like

There are a few options for what nicer code might look like.

If ordered and unordered categoricals had different dtypes (as in R with factor vs. ordered), then I could just write people.select_dtypes("ordered"). Unfortunately, this would have breaking changes for all other code that assumes the dtype of ordered categoricals.

If dataframe-level .cat.* methods existed, I could write something like

is_ordered = people.cat.ordered # should return [False, pd.NA, True]
people.loc[:, is_ordered & pd.notnull(is_ordered)]

A variation on this might be to have more specialized equivalents of .api.types.is_categorical_dtype(), perhaps .api.types.is_ordered_categorical_dtype() and .api.types.is_unordered_categorical_dtype().

API breaking implications

The first option mentioned above has API breaking changes; the other two options do not.

Additional context

I asked the internet for better solutions; no response so far.

samukweku commented 2 years ago

The specialised ideas seem a better route to take

ShaopengLin commented 1 year ago

I am new to pandas, is there a huge performance overhead for apply on DataFrame? If not, then for the specialized version, we can stay consistent with the input of is_categorical_dtype(). A similar boolean array can be achieved with df.apply(pd.api.types.is_ordered_categorical_dtype), though it will be without the pd.NA to signal non-categorical columns.

To retrieve the columns we can then simply do this: people.loc[:, people.apply(pd.api.types.is_ordered_categorical_dtype)]