Open indigoviolet opened 2 years ago
As a work around, can you replace
the regex with something static and then split
on that?
Like with_column(pl.col(yourcol).str.replace('\d{1,2}','|D|D|D|D').str.split('|D|D|D|D'))
Just bumped into this.
Workaround was to use .extract_all()
then .replace()
which is mostly equivalent.
df = pl.DataFrame({
"data": [ "AB one ABB two ABBBBBB three ABBBBBBBB"]
})
pattern = r"AB+"
df.select(
pl.col("data")
.str.extract_all(rf".*?({pattern}|$)")
.arr.eval(
pl.all().str.replace(pattern, ""),
parallel=True)
)
shape: (1, 1)
┌──────────────────────────────┐
│ data │
│ --- │
│ list[str] │
╞══════════════════════════════╡
│ ["", " one ", ... " three "] │
└──────────────────────────────┘
Seems like it could be useful if it worked like the other .extract()
/ .replace()
methods with a literal: bool
option to disable regex matching.
python split
works a bit differently than polars split
, whereby multiple split characters are removed in the former.
In python:
hello world
becomes:
['hello', 'world']
if you split on space whereas in polars there would be multiple list entries for each space. at times it is helpful to handle multiple split characters in a row though.
@evbo That's only if you do not supply a sep
is it not?
'hello world'.split() # sep=None
# ['hello', 'world']
'hello world'.split(' ')
# ['hello', '', '', '', 'world']
pl.select(pl.lit('hello world').str.split(' ')).item()
# shape: (5,)
# Series: '' [str]
# [
# "hello"
# ""
# ""
# ""
# "world"
# ]
@cmdlineluser thanks, I should have clarified for the Rust API this is not currently (documented as) supported by the API. If you try to pass lit(Null {})
to split
it will complain it must have a UTF8
Expr
.
SchemaMismatch( ErrString( "invalid series dtype: expected
Utf8
, gotnull
", ), )
I found this which worked well for my case:
https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.str.extract_groups.html
I did:
extract_groups(pattern).struct.rename_fields("a", "b", "c").alias("fields")
And then
unnest("fields")
I would accept a PR on this. If we can keep the non-regex fast path.
Also the regex parser used by polars doesn't appear to support look-ahead/look-behind which I feel is important for splitting - i.e. I often want to split on a zero-length token, for example between text and numbers etc.
ComputeError: regex error: regex parse error:
.*?((?<=[a-zA-Z])(?=\d)|$)
^^^^
error: look-around, including look-ahead and look-behind, is not supported
Note this is part of a regex I use frequently in a huggingface (i.e. rust backed) tokenizer so the regex engine they use supports look-around.
Edit: hugginface use onigruma
rather than the rust regex engine - https://github.com/huggingface/tokenizers/issues/1057
@david-waterworth I think they picked the one they did because look arounds are relatively slow as they're recursive. One could build a plugin that used the other regex engine.
Problem Description
I want to tokenize a string column, and there are multiple split characters; I believe my current options are to
.apply()
explode()
/str.split
passesflatten()
andstr.split()
It would be nicer to have
rsplit
or regex support insplit
itself (contains
,replace
both already support it).It would be also nice to have list-flattening support (ie not explode but taking a nested list and making it unnested).