kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
93 stars 89 forks source link

Support for inserting data in ibis.TableDataset #834

Open vishu1994 opened 1 month ago

vishu1994 commented 1 month ago

Description

In ETL pipelines, loading transformed data into various data warehouses is a critical requirement. Currently, the ibis.TableDataset connector in Kedro does not support data insertion into Ibis backends.

Context

Why is this change important to me?

We are developing ETL pipelines in our organization, and inserting records into data warehouses is an essential requirement. At present, without support for data insertion, we must bypass the Kedro DataCatalog and rely on external ORM tools to handle native data storage operations, such as SQLAlchemy , dataset etc .

How would I use it?

Supporting data insertion in ibis.TableDataset would allow us to maintain a clean and consistent pipeline, avoiding the need for custom load operations within nodes. This would simplify the workflow and allow Kedro to manage the complete I/O process.

How can it benefit other users?

By enabling this feature, users could avoid writing custom loading logic, thereby keeping their pipelines cleaner and more efficient. This would enhance Kedro's usability in scenarios where heavy I/O operations are involved, particularly for teams working with data warehouses or similar storage backends.

deepyaman commented 1 month ago

Sounds good! I'm going to assign you, since you've expressed interest in contributing to Kedro, and I think this is a great starting point. Happy to help provide guidance (and I think anybody on the Kedro team can also help answer questions, as this should be fairly standard to add).

ibis.TableDataset currently works by calling create_table or create_view here: https://github.com/kedro-org/kedro-plugins/blob/kedro-datasets-4.1.0/kedro-datasets/kedro_datasets/ibis/table_dataset.py#L181

You will need to figure out an ergonomic way to specify that it's going to be an "insert" operation. One possible way is to define a mode argument, similar to https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.mode.html#pyspark.sql.DataFrameWriter.mode or https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html mode. I feel like this would be pretty familiar to Kedro users, but I also haven't given much thought to alternatives so far. :)

Please feel free to further discuss how you want to implement it here, or raise a PR with an initial stab that we can discuss—whatever works best for you!