pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.68k stars 1.9k forks source link

Regex look-ahead/behind support #18286

Closed ThomasMAhern closed 1 month ago

ThomasMAhern commented 1 month ago

Description

I get: ComputeError: regex error: regex parse error: error: unrecognized flag when trying this regex (and other variations) likely due to the look-ahead and look-behind.

Would it be possible for polars to support this?

Example:

(pl.DataFrame({'column': 'The year 2023 is expected to be a good year for technology. The average income was $50,000 in 2020 year."'})
 .with_columns(extract = pl.col('column').str.extract_all(r'\b\d{4}(?= year\b)'))
)
etiennebacher commented 1 month ago

Just for information, this is a limitation of the regex Rust crate that is used under the hood by str.extract_all(). From the docs:

pattern: A valid regular expression pattern, compatible with the regex crate.

And this crate doesn't support look-ahead/behind. From the crate's docs:

The regex syntax supported by this crate is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences.

ion-elgreco commented 1 month ago

There is the fancy-regex crate which could be used. For non look Around stuff it's using the regex crate

ThomasMAhern commented 1 month ago

Ahh, thanks, didn't realize that. That's good to know.

deanm0000 commented 1 month ago

This is probably a good candidate for a plugin https://marcogorelli.github.io/polars-plugins-tutorial/prerequisites/

orlp commented 1 month ago

Yes, we would rather see this in a plugin. Most of the time you don't even need lookahead/behind, and we prefer to be limited to efficiently executable regexes in the default API.

deanm0000 commented 1 month ago

One last note/trick/tip. You can often get the same effect as a look around by chaining a replace. For instance just let ' year' be part of the search and then replace it with ''.

(
    pl.DataFrame({
        'column': 
            ''.join(['The year 2023 is expected to be a good year for technology. ',
                     'The average income was $50,000 in 2020 year."'])
            })
 .with_columns(
     extract = pl.col('column')
     .str.extract_all(r'\b\d{4} year\b')
     .list.eval(pl.element().str.replace_all(" year",""))
     )
)
shape: (1, 2)
┌─────────────────────────────────────────────────────┬───────────┐
│ column                                              ┆ extract   │
│ ---                                                 ┆ ---       │
│ str                                                 ┆ list[str] │
╞═════════════════════════════════════════════════════╪═══════════╡
│ The year 2023 is expected to be a good year for te… ┆ ["2020"]  │
└─────────────────────────────────────────────────────┴───────────┘
deanm0000 commented 1 month ago

With a little more effort you can even do a negative lookahead for instance if you wanted to match r'\b\d{4}(?<! year\b)' you could do

print(
    pl.DataFrame({
        'column': 
            ''.join(['The year 2023 is expected to be a good year for technology. ',
                     'The average income was $50,000 in 2020 year."'])
            })
 .with_columns(
     extract = pl.col('column')
     .str.extract_all(r'\b\d{4} .{4}') # instead of year put a space and any 4 characters
     .list.eval(
         pl.when(~pl.element().str.contains(' year')) # check if it contains ' year'
         .then(pl.element().str.slice(0,4))) # if it doesn't take the 4 chr slice
         # implicitly return null if it does contain ' year'
     .list.drop_nulls() # drop nulls 

     )
)