Closed alexander-beedie closed 1 month ago
As well as being generally useful, I think this would also offer a currently-missing piece of the puzzle towards closing #3775; I was looking at the regex library, and it would be possible to return named capture groups as a struct. This would allow for optimal application of a single-pass regex, returning a {name:capture, ...}
struct, followed by optional breakout into individual columns (named after the capture groups). Nice speedup (ensures the regex only executes once, not once per capture), and it would be a very clean syntax.
eg: should then be possible to write something like...
pl.col("iso_code").str.extract_captures(REGEX).struct.fields("*")
...assuming I added a new extract_captures
method in Rust (which I will happily look into) that returns capture groups in struct form (which would be a handy return format to work with in this case).
This would be great. For one example you could do value_counts
in lazy mode and get a Lazy/DataFrame:
(
df
.lazy()
.select(pl.col("values").value_counts())
.select(pl.col("values").struct.field("*")
)
At present to do value_counts in lazy mode you need to do an extraction of the struct fields that's a bit tedious.
Also: there is a struct.fields method to get field names in Series but not in Expressions. It would be good to have this in expressions as well
I did also notice an initial pl.unnest()
implementation https://github.com/pola-rs/polars/pull/3164
spawned from https://github.com/pola-rs/polars/issues/3123
Describe your feature request
Struct-field equivalents to the super-handy
pl.col("a","b","c","d")
andpl.col("*")
expansions would be a really nice option as, once the struct name and/or number of desired fields to break-out gets large, the amount of boilerplate and number of separate calls/expressions goes up significantly.