Open ElenaKhaustova opened 1 month ago
Still exploring the implementation but here are some insights from what I've found.
For the setup: if you're already using Spark on your project, you don't need to add it as a dependency for DeltaLake. If you're going to add it just for versioning, the Java setup can be a bit of a headache and the dependencies are pretty chunky in file size.
For data formats, DeltaLake uses the parquet format under the hood so it'll work better with data that follows a similar structure. Making it work with unstructured data might be more work than it's worth. Looking at the PolarsDeltaDataset class that @astrojuanlu made for his demo, I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.
I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.
Good question... I guess this is equivalent to avoid making a PandasCSVDataset and PolarsCSVDataset.
Description
At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.
The goal is to check if we can use Delta Lake to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow.
As a result, we expect a working example of kedro project used with Delta Lake for versioning and some assumptions on:
Context
https://github.com/kedro-org/kedro/issues/4199
@astrojuanlu:
Market research