pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.25k stars 1.67k forks source link

Expression-level expansion for multiple `struct.field` names, as exists for `pl.col` #3859

Closed alexander-beedie closed 1 month ago

alexander-beedie commented 1 year ago

Describe your feature request

Struct-field equivalents to the super-handy pl.col("a","b","c","d") and pl.col("*") expansions would be a really nice option as, once the struct name and/or number of desired fields to break-out gets large, the amount of boilerplate and number of separate calls/expressions goes up significantly.

# currently (all explicit):
df.select([
    pl.col("id"),
    pl.col("struct_name").struct.field("a"),
    pl.col("struct_name").struct.field("b"),
    pl.col("struct_name").struct.field("c"),
    pl.col("struct_name").struct.field("d"),
    pl.col("struct_name").struct.field("e"),
    pl.col("struct_name").struct.field("f"),
    pl.col("struct_name").struct.field("g"),
    pl.col("struct_name").struct.field("h"),
]

# ideally (expanded for the given fields):
df.select([
    pl.col("id"),
    pl.col("struct_name").struct.field("a","b","c","d","e","f","g","h"),
])

# and (expanded for all fields):
df.select([
    pl.col("id"),
    pl.col("struct_name").struct.field("*"),
])
alexander-beedie commented 1 year ago

As well as being generally useful, I think this would also offer a currently-missing piece of the puzzle towards closing #3775; I was looking at the regex library, and it would be possible to return named capture groups as a struct. This would allow for optimal application of a single-pass regex, returning a {name:capture, ...} struct, followed by optional breakout into individual columns (named after the capture groups). Nice speedup (ensures the regex only executes once, not once per capture), and it would be a very clean syntax.

eg: should then be possible to write something like...

pl.col("iso_code").str.extract_captures(REGEX).struct.fields("*")

...assuming I added a new extract_captures method in Rust (which I will happily look into) that returns capture groups in struct form (which would be a handy return format to work with in this case).

braaannigan commented 1 year ago

This would be great. For one example you could do value_counts in lazy mode and get a Lazy/DataFrame:

(
df
.lazy()
.select(pl.col("values").value_counts())
.select(pl.col("values").struct.field("*")
)

At present to do value_counts in lazy mode you need to do an extraction of the struct fields that's a bit tedious.

braaannigan commented 1 year ago

Also: there is a struct.fields method to get field names in Series but not in Expressions. It would be good to have this in expressions as well

sm-Fifteen commented 10 months ago

10179, the PR for extract_groups, has now been merged, but extracting the fields from the resulting struct still involves either struct unnesting or selecting each field one by one in a second pass. While this is mainly an ergonomics enhancement, it's likely to start seeing more demand in the coming releases.

cmdlineluser commented 10 months ago

I did also notice an initial pl.unnest() implementation https://github.com/pola-rs/polars/pull/3164

spawned from https://github.com/pola-rs/polars/issues/3123