Open jkausti opened 2 years ago
@jkausti
Great idea. And may I to know how to append all rows from one dataframe into another at this moment? I just want to insert a new row into a dataframe. Thank you very much.
@jkausti
Great idea. And may I to know how to append all rows from one dataframe into another at this moment? I just want to insert a new row into a dataframe. Thank you very much.
You can use the polars.concat function. So you need to create a dataframe first of the row(s) you want to append to the other dataframe. It is documented here: [https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.concat.html]()
Thanks @jkausti This is Python Polars APIs, but is there any Rust Polars APIs for concat function? I use Rust Polars for my program.
Thanks @jkausti This is Python Polars APIs, but is there any Rust Polars APIs for concat function? I use Rust Polars for my program.
@CHCP can you use stackoverflow for that and keep this issue on topic?
Is your question related to syntax or how you could do something with the polars library? Please use stackoverflow and one of the following tags:
This allows us to create high quality answers that remain updated and will save us from answering the same questions over and over again.
If a question is not yet on stackoverflow, please create a new question and post the link here, so we are noted.
If your question doesn't fit stackoverflow, feel fry to ask here. :)
I am currently implementing this using
df1.join(df2, on="col_1", how="anti").vstack(df2)
Which is really fast and does the job. But it would be nicer / cleaner to have this as integrated functionality, simply because of how common it is in data pipelines. It would reduce the potential discrepancies in people's hand-rolled solutions.
Dear Polars Team,
Firstly, I'd like to express my gratitude for the incredible Polars library. I'm thankful for the effort put into its development.
At the moment I try to optimize some i/o processes where I rewrite pandas code using polars. To update some datat I needed an upsert (update+insert) function for dataframes. In my exploration of the Polars API, I spent some time searching for functions that could perform upsert operations. While I eventually implemented a working solution, it took some effort to navigate the API reference and It took me some time to find a good solution. At last I came up with a simple one-liner like @DrMaphuse Please condider, adding such a function to the codebase.
import polars as pl
def upsert_pl(df: pl.DataFrame,
new_df: pl.DataFrame,
keys: str | list[str]) -> pl.DataFrame:
"""
Update and insert rows in an initial DataFrame based on a new DataFrame.
Args:
df (pl.DataFrame): The initial DataFrame.
new_df (pl.DataFrame): The new DataFrame with updates and insertions.
keys (str, list[str]): A single key or a list of keys to identify rows.
Returns:
pl.DataFrame: The final DataFrame after removing matching rows from df,
appending new_df, and sorting by key columns.
Remarks:
- The function assumes that both DataFrames have the same column names.
- The key columns for update and insertion are specified by the columns in
the `on` parameter of the anti_join operation.
- The function sorts the final DataFrame by key columns.
Example:
keys = ['id', 'category']
df = pl.DataFrame(
{'id': [0, 1, 2, 3],
'category': ['Z', 'A', 'B', 'A'],
'name': ['Aaron', 'Alice', 'Bob', 'Charlie'],
'value': [0, 10, 20, 30]})
new_df = pl.DataFrame(
{'id': [2, 3, 4],
'category': ['B', 'A', 'C'],
'name': ['UpdatedBob', 'UpdatedCharlie', 'NewPerson'],
'value': [925, 935, 940]})
result = upsert_pl(df, new_df, keys)
print('initial data:')
print(df)
print('new data:')
print(new_df)
print('upserted data:')
print(result)
"""
return df.join(new_df, on=keys, how='anti').vstack(new_df).sort(keys)
initial data:
shape: (4, 4)
┌─────┬──────────┬─────────┬───────┐
│ id ┆ category ┆ name ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪══════════╪═════════╪═══════╡
│ 0 ┆ Z ┆ Aaron ┆ 0 │
│ 1 ┆ A ┆ Alice ┆ 10 │
│ 2 ┆ B ┆ Bob ┆ 20 │
│ 3 ┆ A ┆ Charlie ┆ 30 │
└─────┴──────────┴─────────┴───────┘
new data:
shape: (3, 4)
┌─────┬──────────┬────────────────┬───────┐
│ id ┆ category ┆ name ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪══════════╪════════════════╪═══════╡
│ 2 ┆ B ┆ UpdatedBob ┆ 925 │
│ 3 ┆ A ┆ UpdatedCharlie ┆ 935 │
│ 4 ┆ C ┆ NewPerson ┆ 940 │
└─────┴──────────┴────────────────┴───────┘
upserted data:
shape: (5, 4)
┌─────┬──────────┬────────────────┬───────┐
│ id ┆ category ┆ name ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪══════════╪════════════════╪═══════╡
│ 0 ┆ Z ┆ Aaron ┆ 0 │
│ 1 ┆ A ┆ Alice ┆ 10 │
│ 2 ┆ B ┆ UpdatedBob ┆ 925 │
│ 3 ┆ A ┆ UpdatedCharlie ┆ 935 │
│ 4 ┆ C ┆ NewPerson ┆ 940 │
└─────┴──────────┴────────────────┴───────┘
I think this is already feasible using update(..., how='outer')
Hey @gab23r , thank you very much for your wisdom.
df.join(new_df, on=keys, how='anti').vstack(new_df).sort(keys)
delivers exactly the same result as:
df.update(new_df, on=keys, how='outer').sort(keys)
Awesome!
Describe your feature request
At the moment the existing behaviour in polars for vertical concats is to append all rows from one dataframe into another. But having a upsert=true parameter would work like doing an upsert in SQL. A 'key' -parameter would most likely be needed as well to specify the column(s) used for calculating the intersect between source and target. So given two dataframes with identical schema, source and target, upsert=true would do the following things:
This is a common pattern for doing data loads into a persistent table in data warehouses, and it would be awesome if polars could support it.