Open vishu1994 opened 2 months ago
Makes sense! Cleanly supporting this will definitely require https://github.com/ibis-project/ibis/issues/5391 (will see if this can be prioritized on the Ibis side in the coming quarter, as there has definitely been a number of requests for this functionality now); otherwise, will need to do something hacky with raw_sql
in the interim.
In the Kedro world (and Spark, pandas, etc.), it seems like a write mode is the most common way to express this behavior (see https://github.com/kedro-org/kedro-plugins/issues/542#issuecomment-1981483776, https://github.com/kedro-org/kedro-plugins/blob/kedro-datasets-4.1.0/kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py#L34 for some Kedro-Datasets examples). As such, I would propose we wait for #834 to establish this functionality, and then exposing the future upsert
implementation from Ibis should be straightforward. 🤞
If somebody needs this functionality before Ibis implements it, we can probably find a way to hack it in with raw_sql
, as mentioned above, or at least share instructions for creating a custom version of this dataset to do so.
Description
In ETL pipelines, updating the existing records in data warehouses is a critical requirement. Currently, the
ibis.TableDataset
connector in Kedro does not supportUpsert
() into Ibis backends likePostgres
in our case. After discussions in the community , I found that ibis doesn't offersUpsert()
natively for any of the backends .Context
Why is this change important to me?
We are developing ETL pipelines in our organization, and updating existing records in sql backend like
Postgres
is an essential requirement. At present, without support forupsert()
, we must bypass the KedroDataCatalog
and rely on external ORM tools to handle native data storage operations, such asSQLAlchemy
,dataset
etc .How would I use it?
Supporting data
upsert()
inibis.TableDataset
would allow us to maintain a clean and consistent pipeline, avoiding the need for custom load operations within nodes. This would simplify the workflow and allow Kedro to manage the complete I/O process.How can it benefit other users?
By enabling this feature, users could avoid writing custom loading logic for update operations, thereby keeping their pipelines cleaner and more efficient. This would enhance Kedro's usability in scenarios where heavy I/O operations are involved, particularly for teams working with data warehouses or similar storage backends.