pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.14k stars 1.83k forks source link

Add pl.Expr.struct.to_list() & pl.Expr.struct.to_array() #13913

Open mkleinbort-ic opened 7 months ago

mkleinbort-ic commented 7 months ago

Description

Hi,

I assumed these features had been requested - but I could not find an issue for them.

The ask is to be able to go from pl.Struct -> pl.List (or pl.Array)

I imagine pl.Expr.struct.to_list or pl.Expr.struct.to_array would be good names.

This would be the polars equivalent of the python dictionary .values()

Why? The current struct API does not have good support for per-element operations. For example, one can do:

col.list.eval(pl.element().pow(-1))

but the same is not available on structs.

One could do an dataframe-level unnest() and then a pl.concat_list but that gets tricky when multiple struct columns share the same field names.

cmdlineluser commented 7 months ago

I had previously gone looking for a .struct.values() / .struct.to_list() type function.

I assumed the reason it did not exist is because struct values are not guaranteed to be of the same type.

Would be useful for when that is the case though.

mkleinbort-ic commented 7 months ago

I'd add that polars lets you create a "mixed-dtype" list, it just coerces everything to some overarching dtype

df_for_testing = pl.DataFrame({
    'A1': [1,2,3],
    'A2': [2,8,7],
    'B1':[[2.2,4.4,6.6],[12,16,14],[152,257,252]],
    'B2':[[1,1],[1,2],[2,1]],
    'C1':['Hello', 'World','!'],
    'C2':['The Test', 'Was For', 'The Takers']
}).with_columns([
    pl.struct(['A1','A2']).alias('As'),
    pl.struct(['B1','B2']).alias('Bs'),
]).with_columns(
    D1 = pl.col('C1').cast(pl.Categorical), 
    D2 = pl.col('C2').cast(pl.Categorical)
)

shape: (3, 10)
┌─────┬─────┬──────────────────┬───────────┬───┬───────────┬──────────────────┬───────┬────────────┐
│ A1  ┆ A2  ┆ B1               ┆ B2        ┆ … ┆ As        ┆ Bs               ┆ D1    ┆ D2         │
│ --- ┆ --- ┆ ---              ┆ ---       ┆   ┆ ---       ┆ ---              ┆ ---   ┆ ---        │
│ i64 ┆ i64 ┆ list[f64]        ┆ list[i64] ┆   ┆ struct[2] ┆ struct[2]        ┆ cat   ┆ cat        │
╞═════╪═════╪══════════════════╪═══════════╪═══╪═══════════╪══════════════════╪═══════╪════════════╡
│ 1   ┆ 2   ┆ [2.2, 4.4, 6.6]  ┆ [1, 1]    ┆ … ┆ {1,2}     ┆ {[2.2, 4.4,      ┆ Hello ┆ The Test   │
│     ┆     ┆                  ┆           ┆   ┆           ┆ 6.6],[1, 1]}     ┆       ┆            │
│ 2   ┆ 8   ┆ [12.0, 16.0,     ┆ [1, 2]    ┆ … ┆ {2,8}     ┆ {[12.0, 16.0,    ┆ World ┆ Was For    │
│     ┆     ┆ 14.0]            ┆           ┆   ┆           ┆ 14.0],[1, 2]}    ┆       ┆            │
│ 3   ┆ 7   ┆ [152.0, 257.0,   ┆ [2, 1]    ┆ … ┆ {3,7}     ┆ {[152.0, 257.0,  ┆ !     ┆ The Takers │
│     ┆     ┆ 252.0]           ┆           ┆   ┆           ┆ 252.0],[2, 1]}   ┆       ┆            │
└─────┴─────┴──────────────────┴───────────┴───┴───────────┴──────────────────┴───────┴────────────┘

df_for_testing.select(pl.concat_list(pl.all()))

shape: (3, 1)
┌────────────────────────────┐
│ A1                         │
│ ---                        │
│ list[str]                  │
╞════════════════════════════╡
│ ["1", "2", … "The Test"]   │
│ ["2", "8", … "Was For"]    │
│ ["3", "7", … "The Takers"] │
└────────────────────────────┘

What I'm asking for is no more dangerous than

(df
  .unnest('structColumn')
  .with_columns(values= pl.concat_list(<<the_field_names_of_the_struct>>))
  .drop(<<the_field_names_of_the_struct>>)
)
deanm0000 commented 7 months ago

Until it's officially added you can monkey patch this:

pl.Expr.struct_to_list=lambda col: (
    col.map_batches(lambda x: (
    x.to_frame().unnest(x.name).select(pl.concat_list(pl.all())).to_series()
))
)

Then you can do

df.with_columns(pl.col('As','Bs').struct_to_list())

Note: I don't know how to monkey-patch it to the struct namespace since it's not just inserting .struct.

mkleinbort-ic commented 7 months ago

Note: I don't know how to monkey-patch it to the struct namespace since it's not just inserting .struct.

Yes, it'd be nice to be able to monkey patch methods in the namespaces - it came up for me when trying to add a .dt.day_name()

cmdlineluser commented 7 months ago

They're in pl.expr.*.Expr*NameSpace

pl.expr.struct.ExprStructNameSpace
pl.expr.datetime.ExprDateTimeNameSpace
pl.expr.list.ExprListNameSpace
deanm0000 commented 7 months ago

That got a little tricky but I think this is it then...

pl.expr.struct.ExprStructNameSpace.to_list=lambda col: (
    pl.Expr._from_pyexpr(col._pyexpr).map_batches(lambda x: (
    x.to_frame().unnest(x.name).select(pl.concat_list(pl.all())).to_series()
))
)

Essentially when you access self (or col as I've named it) you don't get access to .map_batches directly but I think this is the right trick.