pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.67k stars 1.99k forks source link

Support upserts in concat function #3310

Open jkausti opened 2 years ago

jkausti commented 2 years ago

Describe your feature request

At the moment the existing behaviour in polars for vertical concats is to append all rows from one dataframe into another. But having a upsert=true parameter would work like doing an upsert in SQL. A 'key' -parameter would most likely be needed as well to specify the column(s) used for calculating the intersect between source and target. So given two dataframes with identical schema, source and target, upsert=true would do the following things:

This is a common pattern for doing data loads into a persistent table in data warehouses, and it would be awesome if polars could support it.

rts-gordon commented 2 years ago

@jkausti

Great idea. And may I to know how to append all rows from one dataframe into another at this moment? I just want to insert a new row into a dataframe. Thank you very much.

jkausti commented 2 years ago

@jkausti

Great idea. And may I to know how to append all rows from one dataframe into another at this moment? I just want to insert a new row into a dataframe. Thank you very much.

You can use the polars.concat function. So you need to create a dataframe first of the row(s) you want to append to the other dataframe. It is documented here: [https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.concat.html]()

rts-gordon commented 2 years ago

Thanks @jkausti This is Python Polars APIs, but is there any Rust Polars APIs for concat function? I use Rust Polars for my program.

ritchie46 commented 2 years ago

Thanks @jkausti This is Python Polars APIs, but is there any Rust Polars APIs for concat function? I use Rust Polars for my program.

@CHCP can you use stackoverflow for that and keep this issue on topic?

Please use StackOverflow

Is your question related to syntax or how you could do something with the polars library? Please use stackoverflow and one of the following tags:

This allows us to create high quality answers that remain updated and will save us from answering the same questions over and over again.

If a question is not yet on stackoverflow, please create a new question and post the link here, so we are noted.

Other

If your question doesn't fit stackoverflow, feel fry to ask here. :)

DrMaphuse commented 1 year ago

I am currently implementing this using

df1.join(df2, on="col_1", how="anti").vstack(df2)

Which is really fast and does the job. But it would be nicer / cleaner to have this as integrated functionality, simply because of how common it is in data pipelines. It would reduce the potential discrepancies in people's hand-rolled solutions.

OyiboRivers commented 1 year ago

Dear Polars Team,

Firstly, I'd like to express my gratitude for the incredible Polars library. I'm thankful for the effort put into its development.

At the moment I try to optimize some i/o processes where I rewrite pandas code using polars. To update some datat I needed an upsert (update+insert) function for dataframes. In my exploration of the Polars API, I spent some time searching for functions that could perform upsert operations. While I eventually implemented a working solution, it took some effort to navigate the API reference and It took me some time to find a good solution. At last I came up with a simple one-liner like @DrMaphuse Please condider, adding such a function to the codebase.

import polars as pl

def upsert_pl(df: pl.DataFrame,
              new_df: pl.DataFrame,
              keys: str | list[str]) -> pl.DataFrame:
    """
    Update and insert rows in an initial DataFrame based on a new DataFrame.

    Args:
        df (pl.DataFrame): The initial DataFrame.
        new_df (pl.DataFrame): The new DataFrame with updates and insertions.
        keys (str, list[str]): A single key or a list of keys to identify rows.

    Returns:
        pl.DataFrame: The final DataFrame after removing matching rows from df,
                      appending new_df, and sorting by key columns.

    Remarks:
        - The function assumes that both DataFrames have the same column names.
        - The key columns for update and insertion are specified by the columns in
          the `on` parameter of the anti_join operation.
        - The function sorts the final DataFrame by key columns.

    Example:
        keys = ['id', 'category']
        df = pl.DataFrame(
            {'id': [0, 1, 2, 3],
             'category': ['Z', 'A', 'B', 'A'],
             'name': ['Aaron', 'Alice', 'Bob', 'Charlie'],
             'value': [0, 10, 20, 30]})
        new_df = pl.DataFrame(
            {'id': [2, 3, 4],
             'category': ['B', 'A', 'C'],
             'name': ['UpdatedBob', 'UpdatedCharlie', 'NewPerson'],
             'value': [925, 935, 940]})

        result = upsert_pl(df, new_df, keys)
        print('initial data:')
        print(df)
        print('new data:')
        print(new_df)
        print('upserted data:')
        print(result)
    """
    return df.join(new_df, on=keys, how='anti').vstack(new_df).sort(keys)

initial data:
shape: (4, 4)
┌─────┬──────────┬─────────┬───────┐
│ id  ┆ category ┆ name    ┆ value │
│ --- ┆ ---      ┆ ---     ┆ ---   │
│ i64 ┆ str      ┆ str     ┆ i64   │
╞═════╪══════════╪═════════╪═══════╡
│ 0   ┆ Z        ┆ Aaron   ┆ 0     │
│ 1   ┆ A        ┆ Alice   ┆ 10    │
│ 2   ┆ B        ┆ Bob     ┆ 20    │
│ 3   ┆ A        ┆ Charlie ┆ 30    │
└─────┴──────────┴─────────┴───────┘
new data:
shape: (3, 4)
┌─────┬──────────┬────────────────┬───────┐
│ id  ┆ category ┆ name           ┆ value │
│ --- ┆ ---      ┆ ---            ┆ ---   │
│ i64 ┆ str      ┆ str            ┆ i64   │
╞═════╪══════════╪════════════════╪═══════╡
│ 2   ┆ B        ┆ UpdatedBob     ┆ 925   │
│ 3   ┆ A        ┆ UpdatedCharlie ┆ 935   │
│ 4   ┆ C        ┆ NewPerson      ┆ 940   │
└─────┴──────────┴────────────────┴───────┘
upserted data:
shape: (5, 4)
┌─────┬──────────┬────────────────┬───────┐
│ id  ┆ category ┆ name           ┆ value │
│ --- ┆ ---      ┆ ---            ┆ ---   │
│ i64 ┆ str      ┆ str            ┆ i64   │
╞═════╪══════════╪════════════════╪═══════╡
│ 0   ┆ Z        ┆ Aaron          ┆ 0     │
│ 1   ┆ A        ┆ Alice          ┆ 10    │
│ 2   ┆ B        ┆ UpdatedBob     ┆ 925   │
│ 3   ┆ A        ┆ UpdatedCharlie ┆ 935   │
│ 4   ┆ C        ┆ NewPerson      ┆ 940   │
└─────┴──────────┴────────────────┴───────┘
gab23r commented 1 year ago

I think this is already feasible using update(..., how='outer')

OyiboRivers commented 1 year ago

Hey @gab23r , thank you very much for your wisdom.

df.join(new_df, on=keys, how='anti').vstack(new_df).sort(keys)

delivers exactly the same result as:

df.update(new_df, on=keys, how='outer').sort(keys)

Awesome!