pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

min and max return all nulls in pl.Enum #18394

Open stevenlis opened 2 months ago

stevenlis commented 2 months ago

Description

As of polars 1.5.0

import polars as pl

df = pl.DataFrame(
    {'id': ['a', 'a', 'b', 'b', 'c', 'c'],
     'degree': ['low', 'high', 'high', 'mid', 'mid', 'low']}
).with_columns(
    pl.col('degree').cast(pl.Enum(['low', 'mid', 'high']))
).with_columns(
    # returns all nulls
    pl.col('degree').min().over('id').alias('lowest_degree')
)
shape: (6, 3)
┌─────┬────────┬───────────────┐
│ id  ┆ degree ┆ lowest_degree │
│ --- ┆ ---    ┆ ---           │
│ str ┆ enum   ┆ enum          │
╞═════╪════════╪═══════════════╡
│ a   ┆ low    ┆ null          │
│ a   ┆ high   ┆ null          │
│ b   ┆ high   ┆ null          │
│ b   ┆ mid    ┆ null          │
│ c   ┆ mid    ┆ null          │
│ c   ┆ low    ┆ null          │
└─────┴────────┴───────────────┘

expecting:

shape: (6, 3)
┌─────┬────────┬───────────────┐
│ id  ┆ degree ┆ lowest_degree │
│ --- ┆ ---    ┆ ---           │
│ str ┆ enum   ┆ enum          │
╞═════╪════════╪═══════════════╡
│ a   ┆ low    ┆ low           │
│ a   ┆ high   ┆ low           │
│ b   ┆ high   ┆ mid           │
│ b   ┆ mid    ┆ mid           │
│ c   ┆ mid    ┆ low           │
│ c   ┆ low    ┆ low           │
└─────┴────────┴───────────────┘

At this point, one has to use .to_physical() for comparison, Btw, Some expressions such as .first() would work.

cmdlineluser commented 2 months ago

Can reproduce. (Perhaps this was supposed to be labelled as a bug?)

It seems they're only broken in a group_by context?

Although in a "working" case, the return type is str

>>> df.select(pl.col('degree').min())
# shape: (1, 1)
# ┌────────┐
# │ degree │
# │ ---    │
# │ str    │ # <- ???
# ╞════════╡
# │ low    │
# └────────┘
>>> df.group_by('id').agg(pl.col('degree').min())
# shape: (3, 2)
# ┌─────┬────────┐
# │ id  ┆ degree │
# │ --- ┆ ---    │
# │ str ┆ enum   │
# ╞═════╪════════╡
# │ a   ┆ null   │
# │ c   ┆ null   │
# │ b   ┆ null   │
# └─────┴────────┘
cmdlineluser commented 2 weeks ago

After https://github.com/pola-rs/polars/issues/19269 the dtype is now correct in a select context.

df.select(pl.col('degree').min())
# shape: (1, 1)
# ┌────────┐
# │ degree │
# │ ---    │
# │ enum   │
# ╞════════╡
# │ low    │
# └────────┘

The group_by still returns nulls.

df.group_by("id").min()
# shape: (3, 3)
# ┌─────┬────────┬───────────────┐
# │ id  ┆ degree ┆ lowest_degree │
# │ --- ┆ ---    ┆ ---           │
# │ str ┆ enum   ┆ enum          │
# ╞═════╪════════╪═══════════════╡
# │ b   ┆ null   ┆ null          │
# │ a   ┆ null   ┆ null          │
# │ c   ┆ null   ┆ null          │
# └─────┴────────┴───────────────┘

It seems this looks for agg_min() / agg_max() functions which don't seem to be implemented for CategoricalChunked?

https://github.com/pola-rs/polars/tree/main/crates/polars-core/src/series/implementations