ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.43k stars 544 forks source link

Add `Backend.upsert` #5391

Open jcrist opened 1 year ago

jcrist commented 1 year ago

Currently ibis supports create_table for creating (or replacing) a table, and insert for inserting new data into a table. Another common operation is upsert, which is supported by some backends. This is like insert, but also supports updating existing rows.

jcrist commented 1 year ago

cc @prmoore77 - someone mentioned in passing that you might have some use cases for upsert support.

cpcloud commented 1 year ago

This might be useful, but let's wait until someone requests it.

mw3i commented 6 months ago

this would be a really nice feature to have. it would basically mean the only sql library my team needs to use is ibis -- rather than the current awkward mix of sqlalchemy + dataset (for upserting) + ibis

lostmygithubaccount commented 6 months ago

reopening for triage, thanks for the input @mw3i!

NickCrews commented 4 months ago

I have a data web app that has historically only been used for analysis using ibis, but I am about to need to add UPDATE features into it. I would love to stay inside the ibis ecosystem.

@mw3i do you have any other tips or resources I should look at as I go down this path?

prmoore77 commented 4 months ago

cc @prmoore77 - someone mentioned in passing that you might have some use cases for upsert support.

hey @jcrist - yes, Upsert, or MERGE as it is known in quite a few databases is a very useful ETL tool. Having that functionality in back-ends which support the MERGE DML statement would be very useful, and likely get more folks to use Ibis for ETL needs. Thanks!

lostmygithubaccount commented 4 months ago

I think this and some related work in ensuring Ibis DDL things work w/ PyArrow tables and batches would be great and help close the loop. then you can run queries from one backend and use the results to populate tables in another, e.g. PySpark -> Postgres or Trino -> Clickhouse as we've heard recently

mw3i commented 1 month ago

I have a data web app that has historically only been used for analysis using ibis, but I am about to need to add UPDATE features into it. I would love to stay inside the ibis ecosystem.

@mw3i do you have any other tips or resources I should look at as I go down this path?

Sorry for the late reply. Since ibis doesn't have upsert, our team uses the dataset library for inserts/updates/upserts. We do all our normal code in ibis, and then updates/upserts are done with:

import dataset

params = f"{dialect}://{user}:{password}@{url}:{port}/{database}" # <-- sqlalchemy database uri string
with dataset.connect(params) as dbx:
    dbx[tablename].upsert_many(
        list_of_dictionaries, # <-- each dict is a db entry
        column_to_use_as_identifier,
    )

It's not ideal but it lets us use ibis for everything except db writes (and ibis is just too good of a library to give up)

NickCrews commented 1 month ago

Thank you @mw3i, that looks great! Really if we implement #9186 and #9185, this would make it very easy to just basically pawn off this responsibility to users to do themselves with an external library. Same thing for other feature requests that come in: the easier we make it to interoperate with other libs, the less pressure there is for us to implement it ourselves.

mw3i commented 1 month ago

Fair point about interoperability (anything to reduce internal developer workload is a plus); though the dataset library in particular might be idiosyncratic (idk how many other libraries expect data to be formatted that way)

Given ibis leverages sqlalchemy already, it might not be too hard to maintain just 2 functions (insert and update, where update has an 'upsert' argument) that implement sqlalchemy inserts and upserts in the background without needing to make specific functions for each backend (though apologies if I'm being naive here)

I've seen some relative straightforward gists that implement pseudo upserts in sqlalchemy. Idk how easily something like that could fit into the framework ibis uses