pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

Unordered enum data type #16699

Open butterlyn opened 4 months ago

butterlyn commented 4 months ago

Description

Following on from suggestion in https://github.com/pola-rs/polars/issues/16689

Add a boolean parameter ordered to polars.Enum to allow for evaluating Enums irrespective of their category order.

The following should raise no errors:

import polars as pl

assert pl.Enum(categories=["yes", "no"], ordered=False) == pl.Enum(categories=["no", "yes"], ordered=True)
assert pl.Enum(categories=["yes", "no"], ordered=True) == pl.Enum(categories=["no", "yes"], ordered=True)
assert pl.Enum(categories=["yes", "no"], ordered=False) != pl.Enum(categories=["no", "yes"], ordered=False)

Example use case - unit testing

The intended purpose is to allow for defining an unordered pl.Enum in unit tests which can be used in columns of a DataFrame/LazyFrame supplied to polars.testing.assert_frame_equal. The idea is that the unit test should check that the correct pl.Enum is cast to the correct columns without caring about the order of the enum categories defined in the source code.

For example, for my_module:

source_code_dataframe = pl.DataFrame(
    data={
        "enum_column": ["yes", "yes", "no"]
    },
    schema={
        "enum_column": pl.Enum(["no", "yes"]),  # defaults to ordered=True
    }
)

We could write a unit test:

import polars as pl
from polars.testing import assert_frame_equal

from my_module import source_code_dataframe

def test_dataframe():
    expected = pl.DataFrame(
        data={
            "enum_column": ["yes", "yes", "no"]
        },
        schema={
            "enum_column": pl.Enum(["yes", "no"], ordered=False),  # specify that different enum category order shouldn't raise a dtype error
        }
    )
    assert_frame_equal(source_code_dataframe, expected)
stinodego commented 4 months ago

If you just need this for unit tests, you can just write your own assertion util for this, e.g. cast your Enums to strings before checking equality, and then check Enum categories separately.

Still this is an important variant of categorical data that we should support.

butterlyn commented 4 months ago

Thanks @stinodego, yeah I mostly need this for unit testing

If you recommend casting Enums to strings when using polars.testing.assert_frame_equal, perhaps you'd consider reopening this: https://github.com/pola-rs/polars/issues/16075? 😁 No fuss if not, happy to make do in the meantime since unordered enum can serve the same purpose

stinodego commented 4 months ago

If you recommend casting Enums to strings when using polars.testing.assert_frame_equal, perhaps you'd consider reopening this: https://github.com/pola-rs/polars/issues/16075? 😁 No fuss if not, happy to make do in the meantime since unordered enum can serve the same purpose

No, because the point I made there still stands :)

mcrumiller commented 4 months ago

@butterlyn for now you can just make sure to always sort your categories before creating your Enum dtype.

def sorted_enum(categories):
    return pl.Enum(sorted(categories))

assert_series_equal(
    pl.Series(["a"], dtype=sorted_enum(["a", "b", "c"])),
    pl.Series(["a"], dtype=sorted_enum(["b", "c", "a"])),  # different order
)