kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
91 stars 84 forks source link

Support for Upsert functionality in ibis.TableDataset #835

Open vishu1994 opened 1 week ago

vishu1994 commented 1 week ago

Description

In ETL pipelines, updating the existing records in data warehouses is a critical requirement. Currently, the ibis.TableDataset connector in Kedro does not support Upsert() into Ibis backends like Postgres in our case. After discussions in the community , I found that ibis doesn't offers Upsert()natively for any of the backends .

Context

Why is this change important to me?

We are developing ETL pipelines in our organization, and updating existing records in sql backend like Postgres is an essential requirement. At present, without support for upsert(), we must bypass the Kedro DataCatalog and rely on external ORM tools to handle native data storage operations, such as SQLAlchemy , dataset etc .

How would I use it?

Supporting data upsert() in ibis.TableDataset would allow us to maintain a clean and consistent pipeline, avoiding the need for custom load operations within nodes. This would simplify the workflow and allow Kedro to manage the complete I/O process.

How can it benefit other users?

By enabling this feature, users could avoid writing custom loading logic for update operations, thereby keeping their pipelines cleaner and more efficient. This would enhance Kedro's usability in scenarios where heavy I/O operations are involved, particularly for teams working with data warehouses or similar storage backends.

deepyaman commented 1 week ago

Makes sense! Cleanly supporting this will definitely require https://github.com/ibis-project/ibis/issues/5391 (will see if this can be prioritized on the Ibis side in the coming quarter, as there has definitely been a number of requests for this functionality now); otherwise, will need to do something hacky with raw_sql in the interim.

In the Kedro world (and Spark, pandas, etc.), it seems like a write mode is the most common way to express this behavior (see https://github.com/kedro-org/kedro-plugins/issues/542#issuecomment-1981483776, https://github.com/kedro-org/kedro-plugins/blob/kedro-datasets-4.1.0/kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py#L34 for some Kedro-Datasets examples). As such, I would propose we wait for #834 to establish this functionality, and then exposing the future upsert implementation from Ibis should be straightforward. 🤞

If somebody needs this functionality before Ibis implements it, we can probably find a way to hack it in with raw_sql, as mentioned above, or at least share instructions for creating a custom version of this dataset to do so.