pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

Add pl.Expr.clear() #15427

Open mkleinbort opened 6 months ago

mkleinbort commented 6 months ago

Description

I had the need to return an all-null version of a column without changing ita name or data type.

There are a few ways of doing this in the expression API (e.g. multiplying times None, using map_elements, etc...), but I didn't see an obvious and idiomatic way to simply return nulls.

For that usecase, pl.Expr.clear() seems like the right solution.

It could the be used in the usual places...

df_sanitized = df.with_columns(cs.contains('pii_data').clear())
reswqa commented 6 months ago

Let me understand, do you mean to make all elements null here? This is not consistent with my understanding of clear without arguments. 🤔

mkleinbort-ic commented 6 months ago

I think in the expression context it makes sense for the number of rows to be implicit. But to clarify the ask, it's not obvious how to "null" a column or group of columns in polars.

import polars as pl
import polars.selectors as cs 

pl.DataFrame({
    'x1': [{'name': 'Alice'}],
    'x2': [2]
}).with_columns(
    cs.by_name('x1', 'x2').map_elements(lambda _: pl.lit(None)).name.suffix('_v1'),
    cs.by_name('x1', 'x2').add(pl.lit(None)).name.suffix('_v2'),    
)

shape: (1, 6)
┌───────────┬─────┬────────┬────────┬───────────┬───────┐
│ x1        ┆ x2  ┆ x1_v1  ┆ x2_v1  ┆ x1_v2     ┆ x2_v2 │
│ ---       ┆ --- ┆ ---    ┆ ---    ┆ ---       ┆ ---   │
│ struct[1] ┆ i64 ┆ object ┆ object ┆ struct[1] ┆ i64   │
╞═══════════╪═════╪════════╪════════╪═══════════╪═══════╡
│ {"Alice"} ┆ 2   ┆ null   ┆ null   ┆ {null}    ┆ null  │
└───────────┴─────┴────────┴────────┴───────────┴───────┘

Doing map_elements does not preserve type, and trying do to operations with null is not consistent in how it affects various types (e.g. adding pl.lit(None) does not work as I intended with structs, and crashes on list-type columns). Using .replace also does not work great.

Do you know a better way to tell polars to convert all values in a column(s) to null while keeping the schema unchanged?

reswqa commented 6 months ago

How about pl.when(pl.col("a").is_null()).then(pl.col("a")).otherwise(None)? The schema should be the same as truthy expr. But I didn't think more carefully about the multi-column case.

cmdlineluser commented 6 months ago

I had used pl.when(False).then(pl.all()) for this, but it doesn't work with your example:

# PanicException: not implemented for dtype Object("object", Some(object-registry))

I had thought it would be handier if .clear() defaulted to the .height instead of 0 (but maybe it doesn't make sense for the LazyFrame case?)

mkleinbort-ic commented 6 months ago

I had thought it would be handier if .clear() defaulted to the .height instead of 0 but maybe it doesn't make sense for the LazyFrame case.

I think the .clear on the expression side (e.g. pl.col('x').clear()) would have to evaluate to a full-length column of nulls, same as pl.lit(None).

Also, I think we can all agree:

pl.when(False).then(pl.all())

is 10% genious and 90% a hack around a missing api.

also, my current solution is

columns_to_null = ['a','b','c']
df.with_columns(pl.lit(None, dtype=dtype).alias(c) for c,dtype in df.schema.items() if c in columns_to_null)