pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.03k stars 1.72k forks source link

`str.replace_many` to take a dictionary that defines a replacement mapping. #17220

Open avhz opened 2 weeks ago

avhz commented 2 weeks ago

Description

Currently the str.replace_many method takes two lists for the original and replacement strings.

df = polars.DataFrame({
    "A": ["a", "b", "c", "d", "e"],
})

df.with_columns(
    polars.col("A").str.replace_many(
        patterns=["a", "b", "c"],
        replace_with=["x", "y", "z"],
    )
)
┌─────┐
│ A   │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
│ c   │
│ d   │
│ e   │
└─────┘

┌─────┐
│ A   │
│ --- │
│ str │
╞═════╡
│ x   │
│ y   │
│ z   │
│ d   │
│ e   │
└─────┘

It would be handy to include the ability to just pass a dictionary which defines the replacement mapping:

map = {
    "a": "x",
    "b": "y",
    "c": "z",
}

df.with_columns(
    polars.col("A").str.replace_many(map)
)
Arengard commented 2 weeks ago

just use

map = { "a": "x", "b": "y", "c": "z", }

df.with_columns( polars.col("A").replace(map) )

stinodego commented 2 weeks ago

Good suggestion - we can do the same 'trick' as we do in replace.

avhz commented 2 weeks ago

Is there a way that we can make this work with regex ?

I have tried something like:

import regex
import polars 

df = polars.DataFrame({
    "x": ["a", "b", "c", "1", "2", "3"],
})

map = {
    regex.compile(r"[a-z]"): "alpha",
    regex.compile(r"[0-9]"): "digit",
}

df.with_columns(
    polars.col("x").replace(map)
)

For my personal use case, I need to replace a large number of regex patterns, and it's not very ergonomic to use two lists because it can be hard to keep track of what is replacing what.

Another possibility is a list of tuples:

map = [
    (r"[a-z]", "alpha"),
    (r"[0-9]", "digit"),
]

This (in my opinion) is nicer to follow than something like:

old = [r"[a-z]", r"[0-9]"]
new = ["alpha", "digit"]
stinodego commented 2 weeks ago

A dict is just going to be syntactic sugar. You can just define your map and then call str.replace_many(map.keys(), map.values()).

avhz commented 2 weeks ago

That gives me:

TypeError: cannot create expression literal for value of type dict_keys: 
...

Hint: Pass `allow_object=True` to accept any value and create a literal of type Object.
stinodego commented 2 weeks ago

Call list on each input then. I'm just saying: this is just some syntactic sugar that you can do yourself. You don't need us to take care of it. Though it would be nice if we did.

avhz commented 2 weeks ago

I must be missing something, as none of the following work for me:

## ============================================================================

import regex
import polars

## ============================================================================

df = polars.DataFrame(
    {
        "x": ["a", "b", "c", "1", "2", "3"],
    }
)

## ============================================================================

map = {
    regex.compile(r"[a-z]"): "alpha",
    regex.compile(r"[0-9]"): "digit",
}

df.with_columns(polars.col("x").replace(map))
df.with_columns(polars.col("x").replace(map.keys(), map.values()))
df.with_columns(polars.col("x").replace(list(map.keys()), list(map.values())))

## ============================================================================

map = {
    r"[a-z]": "alpha",
    r"[0-9]": "digit",
}

df.with_columns(polars.col("x").str.replace_many(map))
df.with_columns(polars.col("x").str.replace_many(map.keys(), map.values()))
df.with_columns(polars.col("x").str.replace_many(list(map.keys()), list(map.values())))

## ============================================================================

All throw an exception except the third (when calling list()), which does not throw an exception, but also does not match the regex pattern, so I am left with the original dataframe.

cmdlineluser commented 2 weeks ago

There are a few different issues:

  1. You're passing regex.compile() objects - which Polars does not understand.

Polars uses the Rust crate https://github.com/rust-lang/regex - so you must pass "strings" when using the regex functions.

  1. .str.replace_many does not work with regular expressions. (perhaps a note could be added to the docs?)

It uses https://github.com/BurntSushi/aho-corasick which works with "literal strings" only.

It sounds like you may really be asking for: