pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.25k stars 1.95k forks source link

Add `.cat.remove_categories` to remove unused or specified categories from the revmap #14986

Open mcrumiller opened 8 months ago

mcrumiller commented 8 months ago

Description

Often when one is picking out subsets of categorical variables, it is desirable to remove the unused categories:

possible_guests = pl.Series(["John", "Quincy", "Sarah", "Evelyn"], dtype=pl.Categorical)
responded_yes = ["John", "Evelyn"]
attending = possible_guests.filter(all_people.is_in(invited))

# we still have categories for Quincy and Sarah, even though they are no longer used
print(attending.cat.get_categories())
shape: (4,)
Series: '' [str]
[
        "John"
        "Quincy"
        "Sarah"
        "Evelyn"
]

Remove unused categories

>>> attending.cat.remove_categories().cat.get_categories()
shape: (2,)
Series: '' [str]
[
        "John"
        "Evelyn"
]

Remove specified categories

Note that values whose category was removed are converted to null.


>>> a_new = attending.cat.remove_categories(["John", "Sarah"])
>>> a_new
shape: (2,)
Series: '' [cat]
[
        null
        "Evelyn"
]
>>> a_new.cat.get_categories()
shape: (2,)
Series: '' [str]
[
        "Quincy"
        "Evelyn"
]
c-peters commented 8 months ago

Categoricals are represented as an array in the background for performance reasons (to avoid a hashmap lookup everytime we go from physical to encoding). Removing categories is therefore an expensive operation, as it would require re-encoding the physicals to the new categorical array.

I'm hesitant to include features like these as they appear cheap on the outside, but are quite expensive to run.