str.split() should support regex

pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

https://docs.pola.rs

Other

29.51k stars 1.88k forks source link

str.split() should support regex #4819

Open indigoviolet opened 2 years ago

indigoviolet commented 2 years ago

Problem Description

I want to tokenize a string column, and there are multiple split characters; I believe my current options are to

.apply()
go through multiple explode()/str.split passes
chain a bunch of flatten() and str.split()

It would be nicer to have rsplit or regex support in split itself (contains, replace both already support it).

It would be also nice to have list-flattening support (ie not explode but taking a nested list and making it unnested).

deanm0000 commented 1 year ago

As a work around, can you replace the regex with something static and then split on that?

Like with_column(pl.col(yourcol).str.replace('\d{1,2}','|D|D|D|D').str.split('|D|D|D|D'))

cmdlineluser commented 1 year ago

Just bumped into this.

Workaround was to use .extract_all() then .replace() which is mostly equivalent.

df = pl.DataFrame({
   "data": [ "AB one ABB two ABBBBBB three ABBBBBBBB"]
})

pattern = r"AB+"

df.select(
   pl.col("data")
     .str.extract_all(rf".*?({pattern}|$)")
     .arr.eval(
        pl.all().str.replace(pattern, ""),
        parallel=True)
)

shape: (1, 1)
┌──────────────────────────────┐
│ data                         │
│ ---                          │
│ list[str]                    │
╞══════════════════════════════╡
│ ["", " one ", ... " three "] │
└──────────────────────────────┘

Seems like it could be useful if it worked like the other .extract() / .replace() methods with a literal: bool option to disable regex matching.

evbo commented 1 year ago

python split works a bit differently than polars split, whereby multiple split characters are removed in the former.

In python: hello world becomes: ['hello', 'world']

if you split on space whereas in polars there would be multiple list entries for each space. at times it is helpful to handle multiple split characters in a row though.

cmdlineluser commented 1 year ago

@evbo That's only if you do not supply a sep is it not?

'hello    world'.split() # sep=None
# ['hello', 'world']

'hello    world'.split(' ')
# ['hello', '', '', '', 'world']

pl.select(pl.lit('hello    world').str.split(' ')).item()
# shape: (5,)
# Series: '' [str]
# [
#   "hello"
#   ""
#   ""
#   ""
#   "world"
# ]

evbo commented 1 year ago

@cmdlineluser thanks, I should have clarified for the Rust API this is not currently (documented as) supported by the API. If you try to pass lit(Null {}) to split it will complain it must have a UTF8 Expr.

SchemaMismatch( ErrString( "invalid series dtype: expected Utf8, got null", ), )

TheWizier commented 10 months ago

I found this which worked well for my case: https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html I did: extract_groups(pattern).struct.rename_fields("a", "b", "c").alias("fields") And then unnest("fields")

ritchie46 commented 9 months ago

I would accept a PR on this. If we can keep the non-regex fast path.

david-waterworth commented 6 months ago

Also the regex parser used by polars doesn't appear to support look-ahead/look-behind which I feel is important for splitting - i.e. I often want to split on a zero-length token, for example between text and numbers etc.

ComputeError: regex error: regex parse error:
    .*?((?<=[a-zA-Z])(?=\d)|$)
        ^^^^
error: look-around, including look-ahead and look-behind, is not supported

Note this is part of a regex I use frequently in a huggingface (i.e. rust backed) tokenizer so the regex engine they use supports look-around.

Edit: hugginface use onigruma rather than the rust regex engine - https://github.com/huggingface/tokenizers/issues/1057

deanm0000 commented 6 months ago

@david-waterworth I think they picked the one they did because look arounds are relatively slow as they're recursive. One could build a plugin that used the other regex engine.