pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.29k stars 1.85k forks source link

Extract Underlying Coding from a Categorical or Enum Datatype #18500

Open stevenlis opened 2 weeks ago

stevenlis commented 2 weeks ago

Description

Sometimes I need to extract the underlying coding from a categorical or enum datatype, but currently, I cannot find a good way to do this. Could we add a method for it?

A pandas equivalence can be achieved with the .cat.categories.get_loc() method.

import pandas as pd

df = pd.DataFrame({'col': ['a', 'a', 'b', 'b', 'c', 'd']})
df['col'] = df['col'].astype('category')
print(df['col'].cat.categories.get_loc('c'))
2
orlp commented 2 weeks ago

You can cast to pl.UInt32 or use the .to_physical() method.

>>> df = pl.DataFrame({'x': ['a', 'a', 'b', 'b', 'c', 'd']}, schema={'x': pl.Categorical})
>>> df.with_columns(i = pl.col.x.to_physical())
shape: (6, 2)
┌─────┬─────┐
│ x   ┆ i   │
│ --- ┆ --- │
│ cat ┆ u32 │
╞═════╪═════╡
│ a   ┆ 0   │
│ a   ┆ 0   │
│ b   ┆ 1   │
│ b   ┆ 1   │
│ c   ┆ 2   │
│ d   ┆ 3   │
└─────┴─────┘
stevenlis commented 2 weeks ago

@orlp I'm aware of to_physical, but this is different. It returns all coding, but as I showed in the example, I only want one coding based on the string. If you have a very large dataframe, you have to filter and then use .unique or .first. It is very tedious and inefficient if you need to get the code frequently.

df = pl.DataFrame(
    {'x': ['a', 'a', 'b', 'b', 'c', 'd']}, schema={'x': pl.Categorical}
)
df.filter(pl.col('x') == 'b').select(pl.col('x').to_physical().first())