Open stevenlis opened 2 weeks ago
You can cast to pl.UInt32
or use the .to_physical()
method.
>>> df = pl.DataFrame({'x': ['a', 'a', 'b', 'b', 'c', 'd']}, schema={'x': pl.Categorical})
>>> df.with_columns(i = pl.col.x.to_physical())
shape: (6, 2)
┌─────┬─────┐
│ x ┆ i │
│ --- ┆ --- │
│ cat ┆ u32 │
╞═════╪═════╡
│ a ┆ 0 │
│ a ┆ 0 │
│ b ┆ 1 │
│ b ┆ 1 │
│ c ┆ 2 │
│ d ┆ 3 │
└─────┴─────┘
@orlp I'm aware of to_physical
, but this is different. It returns all coding, but as I showed in the example, I only want one coding based on the string. If you have a very large dataframe, you have to filter and then use .unique
or .first
. It is very tedious and inefficient if you need to get the code frequently.
df = pl.DataFrame(
{'x': ['a', 'a', 'b', 'b', 'c', 'd']}, schema={'x': pl.Categorical}
)
df.filter(pl.col('x') == 'b').select(pl.col('x').to_physical().first())
Description
Sometimes I need to extract the underlying coding from a categorical or enum datatype, but currently, I cannot find a good way to do this. Could we add a method for it?
A pandas equivalence can be achieved with the
.cat.categories.get_loc()
method.