pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.71k stars 17.92k forks source link

ENH: find categorical code against categorical label/value #48766

Open stevenlis opened 2 years ago

stevenlis commented 2 years ago

Feature Type

Problem Description

I wish I could check the underlying code for each value against a categorical column directly without indexing and using cat.codes

Assume I have the following dataframe

import pandas as pd
from pandas.api.types import CategoricalDtype

data = {
    'quarter': ['2019Q4', '2020Q1', '2020Q2', '2020Q3'],
    'num': [12, 23, 34, 67]
}
df = pd.DataFrame(data=data)

cat = CategoricalDtype(categories=data['quarter'], ordered=True)
df.quarter = df.quarter.astype(cat)

I need to select all the rows after 2020Q2. I have to first find the underlying code of the value/label 2020Q2, but I can only do so by indexing the dataframe against it and then use cat.codes, and then indexing the array return to get the first value. This is a little bit tedious.

c = df[df.quarter == '2020Q2'].quarter.cat.codes.values[0]
df[df.quarter.cat.codes > c]

Feature Description

Right now if you use df.quarter.dtype.categories, it only returns the categories as a list

Index(['2019Q4', '2020Q1', '2020Q2', '2020Q3'], dtype='object')

It would be great if there is a attribute to return a map of categories and codes together in a dictionary so that users could simply find the codes by using categories as dict keys For example

df.df.quarter.dtype.cats_codes

returns

{'2019Q4': 1, '2020Q1': 2, '2020Q2': 3, '2020Q3': 4}

Alternative Solutions

Maybe it could also be a get_cat_code() function in pandas api so that users could input a category to get the underlying code, such as get_cat_code(cat='2020Q2')

Additional Context

No response

jreback commented 2 years ago

why? what are you actually trying to do

the codes are an implementation detail

stevenlis commented 2 years ago

@jreback As I explain, to select rows above or below a certain code when you have a ordered categorical column.

jreback commented 2 years ago

these labels should already respond to the full suite of comparators eg

df[df.ordered_cat > 'value1'] should select values that are greater than in code space

stevenlis commented 2 years ago

Indeed, you could do a semantic selection with a categorical, but it might still be helpful, let's say 3 quarters after ... You could simply add 3 to a code. Right now, as far as I know, if you have to do that, you have to index a list .dtype.categories..index(value1) + 3 and then find the value/item in that list.

jreback commented 2 years ago

again these are an implementation detail - you can use them but -1 on adding api beyond which already exists

the semantic selections are pretty useful here ; it's not clear why you cannot simply use these

stevenlis commented 2 years ago

It does not exist... the codes has more use cases than just an implementation detail. For example, if you need to run a regression mode, you can simply use cat.codes to make your input numerical instead of string. It would be helpful to figure out what the code is for each value in that variable. Right now, there is no way to easily know how each of the values is coded other than cat.codes, which is a series method, and you have to index your entire dataframe to use it.

WillAyd commented 2 years ago

I think this could also be useful when you want to maintain a CategoricalDtype for roundtripping some IO formats. With SQL as an example, the CategoricalDtype does the "right thing" when you just build a dataframe and write it, but if you want to issue a WHERE clause on return that only filtered to a subset of your Dtype it becomes difficult to get access to those codes.

I could see it being useful for CategoricalDtype to behave more like an Enum in this instance

WillAyd commented 2 years ago

@StevenLi-DS to clarify this is what I have in mind:

enum.Enum("AnEnum", cat)

Currently this yields TypeError: 'CategoricalDtype' object is not iterable but might be a natural Pythonic way to get what you are ultimately after, without requiring pandas to generate a larger API footprint. Would you be interested in exploring that more?

stevenlis commented 2 years ago

Thank you @WillAyd. I'm not familiar with enum, but think return a dict would give us more usability and flexibility.

JgLemos commented 2 years ago

Hello, this issue is available to take?

WillAyd commented 2 years ago

@JgLemos sure still open

JgLemos commented 2 years ago

take