Open jcrist opened 1 year ago
cc @prmoore77 - someone mentioned in passing that you might have some use cases for upsert support.
This might be useful, but let's wait until someone requests it.
this would be a really nice feature to have. it would basically mean the only sql library my team needs to use is ibis -- rather than the current awkward mix of sqlalchemy + dataset (for upserting) + ibis
reopening for triage, thanks for the input @mw3i!
I have a data web app that has historically only been used for analysis using ibis, but I am about to need to add UPDATE features into it. I would love to stay inside the ibis ecosystem.
@mw3i do you have any other tips or resources I should look at as I go down this path?
cc @prmoore77 - someone mentioned in passing that you might have some use cases for upsert support.
hey @jcrist - yes, Upsert, or MERGE
as it is known in quite a few databases is a very useful ETL tool. Having that functionality in back-ends which support the MERGE
DML statement would be very useful, and likely get more folks to use Ibis for ETL needs. Thanks!
I think this and some related work in ensuring Ibis DDL things work w/ PyArrow tables and batches would be great and help close the loop. then you can run queries from one backend and use the results to populate tables in another, e.g. PySpark -> Postgres or Trino -> Clickhouse as we've heard recently
I have a data web app that has historically only been used for analysis using ibis, but I am about to need to add UPDATE features into it. I would love to stay inside the ibis ecosystem.
@mw3i do you have any other tips or resources I should look at as I go down this path?
Sorry for the late reply. Since ibis doesn't have upsert, our team uses the dataset library for inserts/updates/upserts. We do all our normal code in ibis, and then updates/upserts are done with:
import dataset
params = f"{dialect}://{user}:{password}@{url}:{port}/{database}" # <-- sqlalchemy database uri string
with dataset.connect(params) as dbx:
dbx[tablename].upsert_many(
list_of_dictionaries, # <-- each dict is a db entry
column_to_use_as_identifier,
)
It's not ideal but it lets us use ibis for everything except db writes (and ibis is just too good of a library to give up)
Thank you @mw3i, that looks great! Really if we implement #9186 and #9185, this would make it very easy to just basically pawn off this responsibility to users to do themselves with an external library. Same thing for other feature requests that come in: the easier we make it to interoperate with other libs, the less pressure there is for us to implement it ourselves.
Fair point about interoperability (anything to reduce internal developer workload is a plus); though the dataset library in particular might be idiosyncratic (idk how many other libraries expect data to be formatted that way)
Given ibis leverages sqlalchemy already, it might not be too hard to maintain just 2 functions (insert and update, where update has an 'upsert' argument) that implement sqlalchemy inserts and upserts in the background without needing to make specific functions for each backend (though apologies if I'm being naive here)
I've seen some relative straightforward gists that implement pseudo upserts in sqlalchemy. Idk how easily something like that could fit into the framework ibis uses
Currently
ibis
supportscreate_table
for creating (or replacing) a table, andinsert
for inserting new data into a table. Another common operation isupsert
, which is supported by some backends. This is likeinsert
, but also supports updating existing rows.ibis
is targeted at is there a use case for this operation?Backend.upsert
support?