pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.44k stars 1.97k forks source link

`mean` introduces `NaN` values inside `agg` #18467

Open xuJ14 opened 2 months ago

xuJ14 commented 2 months ago

Checks

Reproducible example

data: test.parquet.zip

Snippet 1:

import polars as pl 
import datetime as dt 
df=pl.read_parquet('./test.parquet')
df.group_by('datetime','cat',maintain_order=True).agg(pl.col('a').mean())

Output: (last few rows)

2024-08-14 10:30:00 "0" NaN
2024-08-14 11:00:00 "0" NaN
2024-08-14 11:30:00 "0" NaN
2024-08-14 14:00:00 "0" NaN
2024-08-14 14:30:00 "0" NaN

You can find the last few data in column "a" is NaN (which should not be, because there's no NaN in this column).

However if you do some filter:

df.filter(pl.col('datetime')>=dt.datetime(2024,8,13,0)).group_by('datetime','cat',maintain_order=True).agg(pl.col('a').mean())

Output:

2024-08-14 10:30:00 "0" null
2024-08-14 11:00:00 "0" null
2024-08-14 11:30:00 "0" null
2024-08-14 14:00:00 "0" null
2024-08-14 14:30:00 "0" null

All the NaNs are now Null (which is the expected behavior).

Log output

The output of first snippet:
estimated unique values: 2
run PARTITIONED HASH AGGREGATION

The output of second snippet:
dataframe filtered
DATAFRAME < 1000 rows: running default HASH AGGREGATION

Expected behavior

There should be no NaN in this example.

Installed versions

--------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: macOS-14.5-arm64-arm-64bit Python: 3.9.6 (default, Feb 3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle 3.0.0 connectorx deltalake fastexcel fsspec gevent great_tables matplotlib 3.8.0 nest_asyncio 1.5.7 numpy 1.24.4 openpyxl pandas 2.0.3 pyarrow 15.0.2 pydantic pyiceberg sqlalchemy 2.0.31 torch 2.0.1 xlsx2csv xlsxwriter
xuJ14 commented 2 months ago

Any comment?

cmdlineluser commented 2 months ago

I can reproduce your example.

(df.group_by('datetime', 'cat', maintain_order=True)
   .mean()
   .tail()
)
# shape: (5, 3)
# ┌─────────────────────┬─────┬─────┐
# │ datetime            ┆ cat ┆ a   │
# │ ---                 ┆ --- ┆ --- │
# │ datetime[μs]        ┆ cat ┆ f64 │
# ╞═════════════════════╪═════╪═════╡
# │ 2024-08-14 10:30:00 ┆ 0   ┆ NaN │
# │ 2024-08-14 11:00:00 ┆ 0   ┆ NaN │
# │ 2024-08-14 11:30:00 ┆ 0   ┆ NaN │
# │ 2024-08-14 14:00:00 ┆ 0   ┆ NaN │
# │ 2024-08-14 14:30:00 ┆ 0   ┆ NaN │
# └─────────────────────┴─────┴─────┘

Some things I noticed: getting rid of the Categorical type makes it go away.

(df.group_by('datetime', pl.col('cat').cast(pl.String), maintain_order=True)
   .mean()
   .tail()
)
# shape: (5, 3)
# ┌─────────────────────┬─────┬──────┐
# │ datetime            ┆ cat ┆ a    │
# │ ---                 ┆ --- ┆ ---  │
# │ datetime[μs]        ┆ str ┆ f64  │
# ╞═════════════════════╪═════╪══════╡
# │ 2024-08-14 10:30:00 ┆ 0   ┆ null │
# │ 2024-08-14 11:00:00 ┆ 0   ┆ null │
# │ 2024-08-14 11:30:00 ┆ 0   ┆ null │
# │ 2024-08-14 14:00:00 ┆ 0   ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0   ┆ null │
# └─────────────────────┴─────┴──────┘

It does not happen with .over()

df.with_columns(pl.col("a").mean().over('datetime', 'cat')).tail()
# shape: (5, 3)
# ┌─────────────────────┬─────┬──────┐
# │ datetime            ┆ cat ┆ a    │
# │ ---                 ┆ --- ┆ ---  │
# │ datetime[μs]        ┆ cat ┆ f64  │
# ╞═════════════════════╪═════╪══════╡
# │ 2024-08-14 14:30:00 ┆ 0   ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0   ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0   ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0   ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0   ┆ null │
# └─────────────────────┴─────┴──────┘

(It may be worth changing to a more specific title e.g. "mean introduces NaN values inside agg - or something along those lines.)