pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.89k stars 1.82k forks source link

FR: accept a function in the arg `value` in `replace()` and `replace_all()` #12110

Open etiennebacher opened 10 months ago

etiennebacher commented 10 months ago

Description

In R, the package stringr can take a function as its replacement value and will apply it on the matches:

stringr::str_replace_all("hello", "[aeiou]", toupper)
#> [1] "hEllO"

Would it be possible to implement this behavior in polars? If not, do you have some ideas on the most efficient way to do this?

cmdlineluser commented 10 months ago

Interesting topic.

Regex::replace_all can take a closure but I don't have enough Rust knowledge to determine if that could somehow be explosed via Polars (seems unlikely?).

The only approach I can think of is to extract the matches/non-matches together and then retest them individually.

Is it currently possible to do this in a simpler way?

pattern = r"[aeiou]."

(pl.select(pl.lit("hello").str.extract_all(f"{pattern}|."))
   .with_columns(
      pl.all().list.eval(
          pl.when(pl.element().str.contains(pattern))
            .then(pl.element().str.to_uppercase())
            .otherwise(pl.element())
      )
   )
)

# The predicate 'col("").str.contains([Utf8([aeiou].)])' in 'when->then->otherwise' 
# is not a valid aggregation and might produce a different number of rows than the group_by operation would. 
# This behavior is experimental and may be subject to change
# shape: (1, 1)
# ┌───────────────────────┐
# │ literal               │
# │ ---                   │
# │ list[str]             │
# ╞═══════════════════════╡
# │ ["h", "EL", "l", "o"] │
# └───────────────────────┘

(I think the warning is a false-positive: #10055)

etiennebacher commented 10 months ago

Thanks, I have no idea how hard it would be to implement that in Rust and if it would be more efficient though.

(Btw, there's a typo in your code, pattern = r"[aeiou]." should be pattern = r"[aeiou]")

cmdlineluser commented 10 months ago

It was just to use a pattern that matches more than a single character for the more general case.

For single chars specifically, .str.explode() could be used instead of the .extract_all()

deanm0000 commented 10 months ago

Using @cmdlineluser's code, you can wrap that in a function like this

def replace_all_func(col, pattern, func):
    """Function to replace_all but takes a function as the replacement string.
    Parameters:
    col: An Expression,
    pattern: a valid regex pattern
    func: A valid expression that begins with pl.element() for example pl.element().str.to_uppercase()"""
    extract_list = col.str.extract_all(f"{pattern}|.")
    return extract_list.list.eval(
            pl.when(pl.element().str.contains(pattern))
            .then(func)
            .otherwise(pl.element())
    ).list.join('')

and then you can either use it like: pl.select(replace_all_func(pl.lit('hello'),"[aeiou]",pl.element().str.to_uppercase() ))

or monkey patch it to the pl.Expr pl.Expr.replace_all_func=replace_all_func

and then you can do

pl.select(pl.lit("hello").replace_all_func("[aeiou]",pl.element().str.to_uppercase() ))

I'm not sure how to monkey patch it to pl.Expr.str since pl.Expr.str.replace_all_func=replace_all_func doesn't work. You could put it in its own namespace using this