pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.25k stars 1.85k forks source link

Ability for str.extract_all to return only capture groups #16602

Open lmocsi opened 3 months ago

lmocsi commented 3 months ago

Description

As of polars==0.20.30

Would be nice if I could tell str.extract_all to behave like str.extract and return only capture groups. (below: data1 and data2 should return the same)

import polars as pl

df = pl.DataFrame({'a': 'Label: name, Value: John, Label: car, Value: Ford'})

pattern = "Label:?(.*?) Value:"
df.with_columns(
    pl.col('a').str.extract(pattern,1).alias('data1'),
    pl.col('a').str.extract_all(pattern).list.get(0).alias('data2')
)
avimallu commented 3 months ago

How do you propose nested capture groups are handled? Do you return a full list of all subgroups?

I'd suggest just using str.replace for this, seems more appropriate.

lmocsi commented 1 month ago

How do you propose nested capture groups are handled? Do you return a full list of all subgroups?

I'd suggest just using str.replace for this, seems more appropriate.

@avimallu Can you give a modified code snippet of the above code for your suggestion?

cmdlineluser commented 1 month ago

Perhaps something like:

df.with_columns(
    pl.col('a').str.extract_all(pattern).list.eval(pl.element().str.replace(pattern, '$1'))
)

# shape: (1, 1)
# ┌─────────────────────┐
# │ a                   │
# │ ---                 │
# │ list[str]           │
# ╞═════════════════════╡
# │ [" name,", " car,"] │
# └─────────────────────┘
lmocsi commented 3 weeks ago

This actually does that, but I would not consider it simple:

df.with_columns(
  pl.col('a').str.extract_all(pattern).list.eval(pl.element().str.replace(pattern, '$1').get(0))
)

#┌────────────┐
#│ a          │
#│ ---        │
#│ list[str]  │
#╞════════════╡
#│ [" name,"] │
#└────────────┘
cmdlineluser commented 3 weeks ago

Yeah, it's not ideal - perhaps there is a simpler workaround.

Update: Seems like this behaviour has been also flagged as a bug:

extract_all_groups is another alternative feature request: