[Versioning]: Explore Kedro + Delta Lake for versioning

ElenaKhaustova commented 1 month ago

Description

At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.

The goal is to check if we can use Delta Lake to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow.

As a result, we expect a working example of kedro project used with Delta Lake for versioning and some assumptions on:

whether it solves the main task and what are the constraints;
how easy is to set up;
how the workflow looks like;
whether any changes are required on the kedro side;
what data formats are supported;
how easy is to work with local/remote storage;
how demanding is it in terms of dependencies.

Context

https://github.com/kedro-org/kedro/issues/4199

@astrojuanlu:

Kedro + Delta Lake is not only possible, but works really well. this is the demo I showed on August 22nd Coffee Chat https://github.com/astrojuanlu/kedro-deltalake-demo

Market research

lrcouto commented 2 days ago

Still exploring the implementation but here are some insights from what I've found.

For the setup: if you're already using Spark on your project, you don't need to add it as a dependency for DeltaLake. If you're going to add it just for versioning, the Java setup can be a bit of a headache and the dependencies are pretty chunky in file size.

For data formats, DeltaLake uses the parquet format under the hood so it'll work better with data that follows a similar structure. Making it work with unstructured data might be more work than it's worth. Looking at the PolarsDeltaDataset class that @astrojuanlu made for his demo, I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.

astrojuanlu commented 2 days ago

I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.

Good question... I guess this is equivalent to avoid making a PandasCSVDataset and PolarsCSVDataset.

kedro-org / kedro

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

Description

Context