pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
26.63k stars 1.63k forks source link

Mean of all-null groups is NaN when group_by is partitioned #16020

Open fredrikmalmfors opened 2 weeks ago

fredrikmalmfors commented 2 weeks ago

Checks

Issue description

The mean() of groups consisting of nulls only is NaN instead of null when the partitioned group_by method is used.

Reproducible example

partitioned hash aggregation

df with 1000+ rows triggers the partitioned aggregation method

df = pl.DataFrame(
    {"category": [1] * 1000, "value": [None] * 1000},
    schema={"category": pl.Int8, "value": pl.Float64},
)
df.group_by('category').mean()

result is NaN (expected null)

estimated unique values: 1
run PARTITIONED HASH AGGREGATION
group_by keys are sorted; running sorted key fast path
shape: (1, 2)
┌──────────┬───────┐
│ category ┆ value │
│ ---      ┆ ---   │
│ i8       ┆ f64   │
╞══════════╪═══════╡
│ 1        ┆ NaN   │
└──────────┴───────┘

default hash aggregation

A df with rows less than 1000 uses default hash aggregation

df = pl.DataFrame(
    {"category": [1] * 999, "value": [None] * 999},
    schema={"category": pl.Int8, "value": pl.Float64},
)
df.group_by('category').mean()

result is null as expected

DATAFRAME < 1000 rows: running default HASH AGGREGATION
shape: (1, 2)
┌──────────┬───────┐
│ category ┆ value │
│ ---      ┆ ---   │
│ i8       ┆ f64   │
╞══════════╪═══════╡
│ 1        ┆ null  │
└──────────┴───────┘

streaming hash aggregation (0.20.19 and earlier)

The now removed streaming hash aggregation produced the expected result.

df = pl.DataFrame(
    {"category": [1] * 1000, "value": [None] * 1000},
    schema={"category": pl.Int8, "value": pl.Float64},
)
df.group_by('category').mean()

result is null as expected

estimated unique values: 1
run STREAMING HASH AGGREGATION
RUN STREAMING PIPELINE
[df -> primitive_group_by -> ordered_sink]
shape: (1, 2)
┌──────────┬───────┐
│ category ┆ value │
│ ---      ┆ ---   │
│ i8       ┆ f64   │
╞══════════╪═══════╡
│ 1        ┆ null  │
└──────────┴───────┘

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35 Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: 0.3.2 deltalake: 0.17.2 fastexcel: 0.10.2 fsspec: 2024.2.0 gevent: hvplot: 0.9.2 matplotlib: 3.8.3 nest_asyncio: 1.5.8 numpy: 1.24.3 openpyxl: 3.1.2 pandas: 2.2.2 pyarrow: 15.0.2 pydantic: 2.5.0 pyiceberg: pyxlsb: 1.0.10 sqlalchemy: 2.0.25 xlsx2csv: 0.8.2 xlsxwriter: 3.1.2 ```