Open vishu1994 opened 1 month ago
Sounds good! I'm going to assign you, since you've expressed interest in contributing to Kedro, and I think this is a great starting point. Happy to help provide guidance (and I think anybody on the Kedro team can also help answer questions, as this should be fairly standard to add).
ibis.TableDataset
currently works by calling create_table
or create_view
here: https://github.com/kedro-org/kedro-plugins/blob/kedro-datasets-4.1.0/kedro-datasets/kedro_datasets/ibis/table_dataset.py#L181
You will need to figure out an ergonomic way to specify that it's going to be an "insert" operation. One possible way is to define a mode argument, similar to https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameWriter.mode.html#pyspark.sql.DataFrameWriter.mode or https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html mode. I feel like this would be pretty familiar to Kedro users, but I also haven't given much thought to alternatives so far. :)
Please feel free to further discuss how you want to implement it here, or raise a PR with an initial stab that we can discuss—whatever works best for you!
Description
In ETL pipelines, loading transformed data into various data warehouses is a critical requirement. Currently, the
ibis.TableDataset
connector in Kedro does not support data insertion into Ibis backends.Context
Why is this change important to me?
We are developing ETL pipelines in our organization, and inserting records into data warehouses is an essential requirement. At present, without support for data insertion, we must bypass the Kedro
DataCatalog
and rely on external ORM tools to handle native data storage operations, such asSQLAlchemy
,dataset
etc .How would I use it?
Supporting data insertion in
ibis.TableDataset
would allow us to maintain a clean and consistent pipeline, avoiding the need for custom load operations within nodes. This would simplify the workflow and allow Kedro to manage the complete I/O process.How can it benefit other users?
By enabling this feature, users could avoid writing custom loading logic, thereby keeping their pipelines cleaner and more efficient. This would enhance Kedro's usability in scenarios where heavy I/O operations are involved, particularly for teams working with data warehouses or similar storage backends.