pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.67k stars 1.79k forks source link

Fill strings placeholders with another columns values #7694

Open PierreOreistein opened 1 year ago

PierreOreistein commented 1 year ago
import polars as pl

df = pl.DataFrame({
    "pre_formatted": ["string_1_{}", "string_2_{}_{}"],
    "values": [["1"], ["2", "3"]]
})
df.head()
>>> | pre_formatted    | values
    | str              | list[str]
    | "string_1_{}"    | ["1"]
    | "string_2_{}_{}"     | ["2", "3"]
df = df.with_columns(
    formatted=pl.format(pl.col("pre_formatted"), *pl.col("values"))
)
df.head()
>>> | pre_formatted    | values     | formatted
    | str              | list[str]  | str
    | "string_1_{}"    | ["1"]      | "string_1_1"
    | "string_2_{}_{}"     | ["2", "3"] | "string_2_2_3"
cmdlineluser commented 1 year ago

The following may only apply to the specific example given - ignore this if it is not relevant to your actual use-case:

The formatting here can be factored out and you can .arr.join the values.

df.with_columns(formatted =  
    pl.col("pre_formatted").str.replace("_[{].*", "_")    
    + pl.col("values").arr.join("_")
)
shape: (2, 3)
┌────────────────┬────────────┬──────────────┐
│ pre_formatted  ┆ values     ┆ formatted    │
│ ---            ┆ ---        ┆ ---          │
│ str            ┆ list[str]  ┆ str          │
╞════════════════╪════════════╪══════════════╡
│ string_1_{}    ┆ ["1"]      ┆ string_1_1   │
│ string_2_{}_{} ┆ ["2", "3"] ┆ string_2_2_3 │
└────────────────┴────────────┴──────────────┘
PierreOreistein commented 1 year ago

Thanks for the answer @cmdlineluser !

Actually, my placeholders can be placed anywhere (beginning, middle or end of the sentence) but I like your idea nonetheless, I like the approach! For now, I going through a Python UDF, it is slow but it is ok as my data are small.

Wondering if the function 'pl.format' could be expanded to cover this use case in general; where use one column containing the placeholders and another one containing the values?

ghuls commented 1 year ago

There is a solution that would almost work if there was some kind of zip operation that would be available on 2 lists:

In [32]: (
    ...:     df.with_columns(
    ...:         pl.col("pre_formatted").str.split("{}").alias("pre_formatted_splitted")
    ...:     )
    ...:     .with_columns(
    ...:         pl.col("pre_formatted_splitted").arr.concat(pl.col("values")).alias("concat")
    ...:     )
    ...:     .with_columns(
    ...:         pl.col("concat").arr.join("").alias("join")
    ...:     )
    ...: )
Out[32]: 
shape: (2, 5)
┌────────────────┬────────────┬────────────────────────┬───────────────────────────┬──────────────┐
│ pre_formatted  ┆ values     ┆ pre_formatted_splitted ┆ concat                    ┆ join         │
│ ---            ┆ ---        ┆ ---                    ┆ ---                       ┆ ---          │
│ str            ┆ list[str]  ┆ list[str]              ┆ list[str]                 ┆ str          │
╞════════════════╪════════════╪════════════════════════╪═══════════════════════════╪══════════════╡
│ string_1_{}    ┆ ["1"]      ┆ ["string_1_", ""]      ┆ ["string_1_", "", "1"]    ┆ string_1_1   │
│ string_2_{}_{} ┆ ["2", "3"] ┆ ["string_2_", "_", ""] ┆ ["string_2_", "_", … "3"] ┆ string_2__23 │
└────────────────┴────────────┴────────────────────────┴───────────────────────────┴──────────────┘
deanm0000 commented 1 year ago

My first reaction is to do this with a replace rather than a split.

df = pl.DataFrame({
    "pre_formatted": ["string_1_{}", "string_2_{}_{}"],
    "values": [["1"], ["2", "3"]]
})
df=df.with_columns(formatted=pl.col('pre_formatted'), vals=pl.col('values'))
while df.filter(pl.col('formatted').str.contains("\{\}")).shape[0]>0:
    df=df.with_columns(formatted=pl.when(pl.col('vals').arr.lengths()>0)
                                .then(pl.col('formatted').str.replace("\{\}", pl.col('vals').arr.first()))
                                .otherwise(pl.col('formatted')),
                       vals=pl.col('vals').arr.slice(1))
df=df.drop('vals')
df

This uses a while loop but it loops one {} at a time rather than row by row so I'm not sure it's the cardinal sin it seems on the surface. It creates an intermediate column for each of the initial string and values. It then checks to see if any of the strings have a {} then if it does it replaces that with whatever is in vals. At the same time it deletes the first element of vals. When there are no more {} then it's done looping and removes the vals intermediate column.

Here's a split methodology that does everything in expressions.

df.with_row_count('i') \
    .with_columns(preform=pl.col('pre_formatted').str.split("{}")) \
    .with_columns(preformx=pl.col('preform').arr.take(pl.arange(0,pl.col('preform').arr.lengths()-1)), 
                  preformn=pl.col('preform').arr.last()) \
    .explode(['values','preformx']) \
    .with_columns(mid=pl.col('preformx') + pl.col('values')) \
    .groupby('i', maintain_order=True) \
    .agg(
        pl.col('pre_formatted').first(), 
        pl.col('values'), 
        mid=pl.col('mid').drop_nulls().str.concat(""), 
        preformn=pl.col('preformn').first()) \
    .select(['pre_formatted', 'values', (pl.col('mid')+pl.col('preformn')).alias('formatted')])

The split method, first assigns an index number to each row as we're going to explode later. Then it does the splitting and separates that split into the n-1 values and then one column for the last value. It does that so the split quantity and the values are equal. Then we explode by those lists. With that exploded, we can put the string portion + the value into a single column. Then we can groupby the original row index and concat them vertically to get back to the same number of rows that we started with. The last thing is to just put the last split on the string we just grouped/concatted.