kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Convert between dataset formats at the catalog level #3942

Open ElenaKhaustova opened 3 weeks ago

ElenaKhaustova commented 3 weeks ago

Description

Users express the need for functionality to convert between different dataset formats at the catalog level. Additionally, integrating Kedro with existing standard dataset formats like dlthub and Ibis would provide users with a convenient way to work with diverse datasets and enable the seamless conversion between formats.

We propose to:

  1. Explore the feasibility of developing methods within the framework's API to facilitate conversion between different dataset formats at the catalog level. These methods should support seamless conversion between common formats such as CSV, JSON, Parquet, and others, providing users with flexibility in working with diverse datasets.
  2. Explore the feasibility of integrating Kedro with existing standard dataset formats such as dlthub and Ibis, allowing users to leverage these formats directly within the framework.

Context

merelcht commented 3 weeks ago

How does this relate to the existing transcoding functionality? (https://docs.kedro.org/en/stable/data/data_catalog_yaml_examples.html#read-the-same-file-using-different-datasets-with-transcoding) And for the Ibis part, would that go beyond the IbisDataset that we've added recently?