Closed ThomasMAhern closed 1 month ago
Just for information, this is a limitation of the regex
Rust crate that is used under the hood by str.extract_all()
. From the docs:
pattern: A valid regular expression pattern, compatible with the regex crate.
And this crate doesn't support look-ahead/behind. From the crate's docs:
The regex syntax supported by this crate is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences.
There is the fancy-regex crate which could be used. For non look Around stuff it's using the regex crate
Ahh, thanks, didn't realize that. That's good to know.
This is probably a good candidate for a plugin https://marcogorelli.github.io/polars-plugins-tutorial/prerequisites/
Yes, we would rather see this in a plugin. Most of the time you don't even need lookahead/behind, and we prefer to be limited to efficiently executable regexes in the default API.
One last note/trick/tip. You can often get the same effect as a look around by chaining a replace. For instance just let ' year' be part of the search and then replace it with ''.
(
pl.DataFrame({
'column':
''.join(['The year 2023 is expected to be a good year for technology. ',
'The average income was $50,000 in 2020 year."'])
})
.with_columns(
extract = pl.col('column')
.str.extract_all(r'\b\d{4} year\b')
.list.eval(pl.element().str.replace_all(" year",""))
)
)
shape: (1, 2)
┌─────────────────────────────────────────────────────┬───────────┐
│ column ┆ extract │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════════════════════════════════════════════════╪═══════════╡
│ The year 2023 is expected to be a good year for te… ┆ ["2020"] │
└─────────────────────────────────────────────────────┴───────────┘
With a little more effort you can even do a negative lookahead for instance if you wanted to match r'\b\d{4}(?<! year\b)'
you could do
print(
pl.DataFrame({
'column':
''.join(['The year 2023 is expected to be a good year for technology. ',
'The average income was $50,000 in 2020 year."'])
})
.with_columns(
extract = pl.col('column')
.str.extract_all(r'\b\d{4} .{4}') # instead of year put a space and any 4 characters
.list.eval(
pl.when(~pl.element().str.contains(' year')) # check if it contains ' year'
.then(pl.element().str.slice(0,4))) # if it doesn't take the 4 chr slice
# implicitly return null if it does contain ' year'
.list.drop_nulls() # drop nulls
)
)
Description
I get:
ComputeError: regex error: regex parse error: error: unrecognized flag
when trying this regex (and other variations) likely due to the look-ahead and look-behind.Would it be possible for polars to support this?
Example: