Open xuJ14 opened 2 months ago
Any comment?
I can reproduce your example.
(df.group_by('datetime', 'cat', maintain_order=True)
.mean()
.tail()
)
# shape: (5, 3)
# ┌─────────────────────┬─────┬─────┐
# │ datetime ┆ cat ┆ a │
# │ --- ┆ --- ┆ --- │
# │ datetime[μs] ┆ cat ┆ f64 │
# ╞═════════════════════╪═════╪═════╡
# │ 2024-08-14 10:30:00 ┆ 0 ┆ NaN │
# │ 2024-08-14 11:00:00 ┆ 0 ┆ NaN │
# │ 2024-08-14 11:30:00 ┆ 0 ┆ NaN │
# │ 2024-08-14 14:00:00 ┆ 0 ┆ NaN │
# │ 2024-08-14 14:30:00 ┆ 0 ┆ NaN │
# └─────────────────────┴─────┴─────┘
Some things I noticed: getting rid of the Categorical type makes it go away.
(df.group_by('datetime', pl.col('cat').cast(pl.String), maintain_order=True)
.mean()
.tail()
)
# shape: (5, 3)
# ┌─────────────────────┬─────┬──────┐
# │ datetime ┆ cat ┆ a │
# │ --- ┆ --- ┆ --- │
# │ datetime[μs] ┆ str ┆ f64 │
# ╞═════════════════════╪═════╪══════╡
# │ 2024-08-14 10:30:00 ┆ 0 ┆ null │
# │ 2024-08-14 11:00:00 ┆ 0 ┆ null │
# │ 2024-08-14 11:30:00 ┆ 0 ┆ null │
# │ 2024-08-14 14:00:00 ┆ 0 ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0 ┆ null │
# └─────────────────────┴─────┴──────┘
It does not happen with .over()
df.with_columns(pl.col("a").mean().over('datetime', 'cat')).tail()
# shape: (5, 3)
# ┌─────────────────────┬─────┬──────┐
# │ datetime ┆ cat ┆ a │
# │ --- ┆ --- ┆ --- │
# │ datetime[μs] ┆ cat ┆ f64 │
# ╞═════════════════════╪═════╪══════╡
# │ 2024-08-14 14:30:00 ┆ 0 ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0 ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0 ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0 ┆ null │
# │ 2024-08-14 14:30:00 ┆ 0 ┆ null │
# └─────────────────────┴─────┴──────┘
(It may be worth changing to a more specific title e.g. "mean
introduces NaN
values inside agg
- or something along those lines.)
Checks
Reproducible example
data: test.parquet.zip
Snippet 1:
Output: (last few rows)
You can find the last few data in column "a" is NaN (which should not be, because there's no NaN in this column).
However if you do some filter:
Output:
All the NaNs are now Null (which is the expected behavior).
Log output
Expected behavior
There should be no NaN in this example.
Installed versions