pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.11k stars 1.83k forks source link

Expression input not working correctly for either argument of `Expr.str.replace_many` #17863

Open henryharbeck opened 1 month ago

henryharbeck commented 1 month ago

Checks

Reproducible example

Example 1

df = pl.DataFrame({
    "text": ["123abc", "abc456"],
    "replacement": ["888", "zzz"],
})

df.with_columns(
    pl.col("text")
    .str.replace_many(["123", "abc"], pl.col("replacement"))
    .alias("replaced")
)
shape: (2, 3)
┌────────┬─────────────┬──────────┐
│ text   ┆ replacement ┆ replaced │
│ ---    ┆ ---         ┆ ---      │
│ str    ┆ str         ┆ str      │
╞════════╪═════════════╪══════════╡
│ 123abc ┆ 888         ┆ 888zzz   │ # <---- note the zzz from the below row
│ abc456 ┆ zzz         ┆ zzz456   │
└────────┴─────────────┴──────────┘

Example 2

df = pl.DataFrame({
    "text": ["123abc", "abc456"],
    "replacement": ["888", "zzz"],
})

df.with_columns(
    pl.col("text")
    .str.replace_many(["123", "abc", "yyy"], pl.col("replacement"))
    .alias("replaced")
)
# InvalidOperationError: expected the same amount of patterns as replacement strings

Example 3

df = pl.DataFrame({
    "text": ["123abc", "abc456"],
    "replace": ["123", "abc"],
})

df.with_columns(
    pl.col("text")
    .str.replace_many(pl.col("replace"), "woo")
    .alias("replaced")
)
shape: (2, 3)
┌────────┬─────────┬──────────┐
│ text   ┆ replace ┆ replaced │
│ ---    ┆ ---     ┆ ---      │
│ str    ┆ str     ┆ str      │
╞════════╪═════════╪══════════╡
│ 123abc ┆ 123     ┆ woowoo   │ # <--- this should have only replaced "123" and not "abc"
│ abc456 ┆ abc     ┆ woo456   │
└────────┴─────────┴──────────┘

Log output

No response

Issue description

It seems that expression input to the replace_with parameter in str.replace_many is not working as intended. It looks to implode the values of the column, which can lead to replacements leaking into other rows, or errors that should not be occuring.

Expected behavior

Example 1 ```python df = pl.DataFrame({ "text": ["123abc", "abc456"], "replacement": ["888", "zzz"], }) print(df.with_columns( pl.col("text") .str.replace_many(["123", "abc"], pl.col("replacement")) .alias("replaced") )) shape: (2, 3) ┌────────┬─────────────┬──────────┐ │ text ┆ replacement ┆ replaced │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str │ ╞════════╪═════════════╪══════════╡ │ 123abc ┆ 888 ┆ 888888 │ │ abc456 ┆ zzz ┆ zzz456 │ └────────┴─────────────┴──────────┘ ```
Example 2 ```python df = pl.DataFrame({ "text": ["123abc", "abc456"], "replacement": ["888", "zzz"], }) df.with_columns( pl.col("text") .str.replace_many(["123", "abc", "yyy"], pl.col("replacement")) .alias("replaced") ) shape: (2, 3) ┌────────┬─────────────┬──────────┐ │ text ┆ replacement ┆ replaced │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str │ ╞════════╪═════════════╪══════════╡ │ 123abc ┆ 888 ┆ 888888 │ │ abc456 ┆ zzz ┆ zzz456 │ └────────┴─────────────┴──────────┘ ```
Example 3 ```python df = pl.DataFrame({ "text": ["123abc", "abc456"], "replace": ["123", "abc"], }) df.with_columns( pl.col("text") .str.replace_many(pl.col("replace"), "woo") .alias("replaced") ) shape: (2, 3) ┌────────┬─────────┬──────────┐ │ text ┆ replace ┆ replaced │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str │ ╞════════╪═════════╪══════════╡ │ 123abc ┆ 123 ┆ wooabc │ │ abc456 ┆ abc ┆ woo456 │ └────────┴─────────┴──────────┘ ```

Installed versions

``` --------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: Linux-6.1.85+-x86_64-with-glibc2.35 Python: 3.10.12 (main, Mar 22 2024, 16:50:05) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: fastexcel: fsspec: 2023.6.0 gevent: great_tables: hvplot: matplotlib: 3.7.1 nest_asyncio: 1.6.0 numpy: 1.25.2 openpyxl: 3.1.5 pandas: 2.0.3 pyarrow: 14.0.2 pydantic: 2.8.2 pyiceberg: sqlalchemy: 2.0.31 torch: 2.3.1+cu121 xlsx2csv: xlsxwriter: ```
cmdlineluser commented 1 month ago

Isn't that the reason for its existence?

It's supposed to perform multiple replacements in each string, i.e. as if you did:

df.with_columns(
    pl.col("text")
    .str.replace_many(["123", "abc"], ["888", "zzz"])
    .alias("replaced")
)

# shape: (2, 3)
# ┌────────┬─────────────┬──────────┐
# │ text   ┆ replacement ┆ replaced │
# │ ---    ┆ ---         ┆ ---      │
# │ str    ┆ str         ┆ str      │
# ╞════════╪═════════════╪══════════╡
# │ 123abc ┆ 888         ┆ 888zzz   │
# │ abc456 ┆ zzz         ┆ zzz456   │
# └────────┴─────────────┴──────────┘

(Perhaps some simpler examples could be added to the docs?)

henryharbeck commented 1 month ago

Yes, but what if I wanted to provide a different replacement value for many string patterns per row? Otherwise, it seems kind of pointless to accept expression arguments.

How should I otherwise achieve the expected result?


df = pl.DataFrame({
    "text": ["123abc", "abc456"],
    "replacement": ["888", "zzz"],
})

df.with_columns(
    pl.col("text")
    .str.replace_many(["123", "abc"], pl.col("replacement"))
    .alias("replaced")
)

# Actual
shape: (2, 3)
┌────────┬─────────────┬──────────┐
│ text   ┆ replacement ┆ replaced │
│ ---    ┆ ---         ┆ ---      │
│ str    ┆ str         ┆ str      │
╞════════╪═════════════╪══════════╡
│ 123abc ┆ 888         ┆ 888zzz   │ # <---- note the zzz from the below row
│ abc456 ┆ zzz         ┆ zzz456   │
└────────┴─────────────┴──────────┘

# Expected
shape: (2, 3)
┌────────┬─────────────┬──────────┐
│ text   ┆ replacement ┆ replaced │
│ ---    ┆ ---         ┆ ---      │
│ str    ┆ str         ┆ str      │
╞════════╪═════════════╪══════════╡
│ 123abc ┆ 888         ┆ 888888   │
│ abc456 ┆ zzz         ┆ zzz456   │
└────────┴─────────────┴──────────┘
cmdlineluser commented 1 month ago

Yeah, you want something like:

df.with_columns(
    pl.col("text").str.replace(pl.col("text").head(3), pl.col("replacement"))
      .alias("replaced")
)
# ComputeError: dynamic pattern length in 'str.replace' expressions is not supported yet

Whereas replace_many is for doing similar to:

pl.col("text")
  .str.replace("123", "888")
  .str.replace("abc", "zzz")
henryharbeck commented 1 month ago

I don't think I am after pl.col("text").head(3) as I would like to replace both "123" and "abc" on each row. My apologies if I didn't get that across well.

Like so:

df.with_columns(
    pl.col("text")
    .str.replace_many(["123", "abc"], "888")
    .alias("replaced")
)

shape: (2, 3)
┌────────┬─────────────┬──────────┐
│ text   ┆ replacement ┆ replaced │
│ ---    ┆ ---         ┆ ---      │
│ str    ┆ str         ┆ str      │
╞════════╪═════════════╪══════════╡
│ 123abc ┆ 888         ┆ 888888   │
│ abc456 ┆ zzz         ┆ 888456   │
└────────┴─────────────┴──────────┘

but instead of replacing both of the strings with "888" for every row, I would like the replacement value (for both "123" and "abc") to be based on pl.col("replacement") (i.e. the value in the given row)

henryharbeck commented 1 month ago

To be honest, I just came across this when trying to give expression inputs to Expr.str.replace_many. I don't really need to do this type of replacement, but was more so seeing what was possible. If expressions are supported here, the behaviour definitely seems unexpected to me.

Looking at Series.str.replace_many, expressions are not supported, so perhaps the type signature for Expr.str.replace_many is more generous than what is (currently) supported...? 🤷‍♂️

Same thing goes for str.contains_any with the different signatures between Series and Expr

deanm0000 commented 1 month ago

I really don't know if this is how it is intended to work but this gets your intended result (at least for example 1 and 2)

def cust_replace_many(source: pl.Expr|str, patterns:list, replace_with:pl.Expr|str):
    if isinstance(source, str):
        source=pl.col(source)
    if isinstance(replace_with, str):
        replace_with=pl.col(replace_with)
    for x in patterns:
        source=source.str.replace(x, replace_with)
    return source
df.with_columns(replaced=cust_replace_many('text', ['123','abc'], 'replacement'))
shape: (2, 3)
┌────────┬─────────────┬──────────┐
│ text   ┆ replacement ┆ replaced │
│ ---    ┆ ---         ┆ ---      │
│ str    ┆ str         ┆ str      │
╞════════╪═════════════╪══════════╡
│ 123abc ┆ 888         ┆ 888888   │
│ abc456 ┆ zzz         ┆ zzz456   │
└────────┴─────────────┴──────────┘

If you monkey patch it to pl.Expr then it works off of the source column

pl.Expr.cust_replace_many=cust_replace_many
df.with_columns(replaced=pl.col('text').cust_replace_many(['123','abc'], 'replacement'))
shape: (2, 3)
┌────────┬─────────────┬──────────┐
│ text   ┆ replacement ┆ replaced │
│ ---    ┆ ---         ┆ ---      │
│ str    ┆ str         ┆ str      │
╞════════╪═════════════╪══════════╡
│ 123abc ┆ 888         ┆ 888888   │
│ abc456 ┆ zzz         ┆ zzz456   │
└────────┴─────────────┴──────────┘

For example 3 you could do this:

(
    df
    .group_by('replace', maintain_order=True)
    .agg(
        pl.col('text').first(),
        replaced=pl.col('text').str.replace_many('replace','woo').first()
        )
)
shape: (2, 3)
┌─────────┬────────┬──────────┐
│ replace ┆ text   ┆ replaced │
│ ---     ┆ ---    ┆ ---      │
│ str     ┆ str    ┆ str      │
╞═════════╪════════╪══════════╡
│ 123     ┆ 123abc ┆ woowoo   │
│ abc     ┆ abc456 ┆ woo456   │
└─────────┴────────┴──────────┘

Actually the group_by approach works for example 1 too

(
    df
    .group_by('replacement', maintain_order=True)
    .agg(
        pl.col('text').first(),
         replaced=pl.col('text').str.replace_many(["123", "abc"],pl.col("replacement")).first()
         )
)
shape: (2, 3)
┌─────────────┬────────┬──────────┐
│ replacement ┆ text   ┆ replaced │
│ ---         ┆ ---    ┆ ---      │
│ str         ┆ str    ┆ str      │
╞═════════════╪════════╪══════════╡
│ 888         ┆ 123abc ┆ 888zzz   │
│ zzz         ┆ abc456 ┆ zzz456   │
└─────────────┴────────┴──────────┘
cmdlineluser commented 1 month ago

I would like the replacement value to be based on the value in the given row

You can mimic the behaviour with a window expression:

df.with_columns(
    pl.col("text")
    .str.replace_many(["123", "abc"], pl.col("replacement").first())
    .over("replacement")
    .alias("replaced")
)
shape: (2, 3)
┌────────┬─────────────┬──────────┐
│ text   ┆ replacement ┆ replaced │
│ ---    ┆ ---         ┆ ---      │
│ str    ┆ str         ┆ str      │
╞════════╪═════════════╪══════════╡
│ 123abc ┆ 888         ┆ 888888   │
│ abc456 ┆ zzz         ┆ zzz456   │
└────────┴─────────────┴──────────┘

But that specific task is just not what replace_many is "designed" for. (Or perhaps I am mistaken?)

henryharbeck commented 1 month ago

Thank you both for being forthcoming with suggestions.

I will say I am still confused about how expressions are intended to be used (if at all) in Expr.str.replace_many It does not seem like they work "out of the box" unless the column being referred to only has 1 unique value (at which point you might as well use a string literal)

Example 2 works when a string literal is passed as the 2nd argument instead of an expression.

I am starting to question whether expressions were intended to be supported at all? Or whether the signature was intended to be more like Series.str.replace_many

henryharbeck commented 4 weeks ago

@ritchie46 / @stinodego - would I be able to ask for your input here?

It looks like string methods generally have the same or a very similar function signature between Series and Expr objects. The aho-corasick methods extract_many and contains_any are exceptions to this. Were they intended to work expressions?

The Expr function signature allows expressions (although the behaviour is buggy IMO as outlined in this issue). The Series function signature does not accept expressions.

ritchie46 commented 4 weeks ago

It is as expected. It takes an expression that evaluates to a series of String type. That string series ARE the replacement values.

We can add the functionality to also accept List<String> types. That would then mean different replacement values per row.