kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.02k stars 906 forks source link

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

Open ElenaKhaustova opened 1 month ago

ElenaKhaustova commented 1 month ago

Description

At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.

The goal is to check if we can use Delta Lake to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow.

As a result, we expect a working example of kedro project used with Delta Lake for versioning and some assumptions on:

Context

https://github.com/kedro-org/kedro/issues/4199

@astrojuanlu:

Kedro + Delta Lake is not only possible, but works really well. this is the demo I showed on August 22nd Coffee Chat https://github.com/astrojuanlu/kedro-deltalake-demo

Market research

lrcouto commented 2 days ago

Still exploring the implementation but here are some insights from what I've found.

For the setup: if you're already using Spark on your project, you don't need to add it as a dependency for DeltaLake. If you're going to add it just for versioning, the Java setup can be a bit of a headache and the dependencies are pretty chunky in file size.

For data formats, DeltaLake uses the parquet format under the hood so it'll work better with data that follows a similar structure. Making it work with unstructured data might be more work than it's worth. Looking at the PolarsDeltaDataset class that @astrojuanlu made for his demo, I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.

astrojuanlu commented 2 days ago

I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.

Good question... I guess this is equivalent to avoid making a PandasCSVDataset and PolarsCSVDataset.