pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.61k stars 1.89k forks source link

Add `polars.Expr.list.drop_nans()` #16736

Open FriedLabJHU opened 4 months ago

FriedLabJHU commented 4 months ago

Description

There is no behavior to remove NaN from Expr.list that functions similarly to Expr.list.drop_nulls(). This would be nice to have, since Expr.drop_nans() already exists.

The only work around appears to involve Expr.list.eval as shown below:

## Expected behavior
df = pl.DataFrame({"a": [[None, 2.0, 4.0], [5.0, 2.0, 1.0]]}, strict=False)
>>>
┌──────────────────┐
│ a                │
│ ---              │
│ list[f64]        │
╞══════════════════╡
│ [null, 2.0, 4.0] │
│ [5.0, 2.0, 1.0]  │
└──────────────────┘

df.with_columns(
    pl.col("a").list.drop_nulls().alias("b")
)
>>>
┌──────────────────┬─────────────────┐
│ a                ┆ b               │
│ ---              ┆ ---             │
│ list[f64]        ┆ list[f64]       │
╞══════════════════╪═════════════════╡
│ [null, 2.0, 4.0] ┆ [2.0, 4.0]      │
│ [5.0, 2.0, 1.0]  ┆ [5.0, 2.0, 1.0] │
└──────────────────┴─────────────────┘
## Drop NaN is not supported
df = pl.DataFrame({"a": [[float("nan"), 2.0, 4.0], [5.0, 2.0, 1.0]]}, strict=False)
>>>
┌─────────────────┐
│ a               │
│ ---             │
│ list[f64]       │
╞═════════════════╡
│ [NaN, 2.0, 4.0] │
│ [5.0, 2.0, 1.0] │
└─────────────────┘

print(df.with_columns(
    pl.col("a").list.drop_nans().alias("b")
))

>>> 
AttributeError: 'ExprListNameSpace' object has no attribute 'drop_nans'

# Solution
df.with_columns(
    pl.col("a").list.eval(pl.element().drop_nans()).alias("b")
)
>>>
┌─────────────────┬─────────────────┐
│ a               ┆ b               │
│ ---             ┆ ---             │
│ list[f64]       ┆ list[f64]       │
╞═════════════════╪═════════════════╡
│ [NaN, 2.0, 4.0] ┆ [2.0, 4.0]      │
│ [5.0, 2.0, 1.0] ┆ [5.0, 2.0, 1.0] │
└─────────────────┴─────────────────┘
jeroenjanssens commented 4 months ago

This also holds for DataFrame and LazyFrame. Here's an overview:

Object .drop_nulls() .drop_nans() .has_nulls() .has_nans()
DataFrame
LazyFrame
Series
Expr
List
FriedLabJHU commented 4 months ago

These all seem like they would be useful additions and would likely be necessary for a v1.0 release, thank you @jeroenjanssens for this table. I will work on PRs for these in the meantime.