pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.28k stars 1.85k forks source link

Allow creating a dataframe with integer categorical values #10430

Open vlulla opened 1 year ago

vlulla commented 1 year ago

Problem description

Consider this brief code segment:

import pandas as pd, polars as pl, pyarrow as pa
hourOfDay = list(range(24))
df = pd.DataFrame({"hod": pd.Categorical([0,1,2,3,], categories=hourOfDay, ordered=True)})
pa_df = pa.Table.from_pandas(df)
pa_df

df_pl = pl.from_dataframe(df) # raises "ComputeError: only string-like values are supported in dictionaries"

I don't understand this restriction. pyarrow allows having dictionaries with integer values! Both, R and pandas also allow creating categorical column with integer values.

By the way, I am aware that I can work around this by using rename_categories ... but i'm wondering what is the basis for this restriction!

## workaround
df['hod'] = df['hod'].cat.rename_categories({c:str(c) for c in df['hod'].cat.categories})
df_pl = pl.from_dataframe(df) ## will work...
df_pl

More important, is there a possibility that this restriction can be relaxed?

Thanks in advance!

alexander-beedie commented 1 year ago

It looks like what you want here is perhaps more of an Enum, in order to constrain the allowed values? And unfortunately we don't currently support Enums. Pandas treats Enums/Categoricals as the same thing (which I don't disagree with, as the concept of "category" can certainly encompass Enum use), whereas we currently treat Categoricals as essentially a string optimisation.

I'll leave it to @ritchie46 to opine on the possibility of extending our current Categorical support further; I imagine it's not a small endeavour, though it's clearly useful ;)

vlulla commented 1 year ago

I read! It is also worth considering that booleans (true/false values) and integers (1/0 for true/false as bit mask images; and classified images...land use land cover classification maps used in ecology) are quite commonly used as categoricals. So, even if string only categories are supported it might be worthwhile to have a few examples, probably in user guide or docstring, showing how to convert these non-string categoricals into categoricals. I anticipate that these examples will be useful for scientists who use R and pandas for modeling (ecology, earth science...my experience) and would like to incorporate polars into their workflow.

Anyways, polars is awesome and every time I use it i like it more! Thank you for your consideration.

alexander-beedie commented 9 months ago

Ref: https://github.com/pola-rs/polars/pull/11822

orlp commented 9 months ago

Enum support is on its way, I suppose we can then eventually support an Enum with integer entries.

stinodego commented 7 months ago

This is something we would like to support.