kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.02k stars 906 forks source link

[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

Open ElenaKhaustova opened 5 months ago

ElenaKhaustova commented 5 months ago

Description

  1. Users admit the lack of persistency in the add workflow, as there is no built-in functionality to save modified catalogs.
  2. Users express the need for an API to save and load catalogs after compilation or modification by converting catalogs to YAML format and back.
  3. Users encounter difficulties loading pickled DataCatalog objects when the Kedro version changes when loading, leading to compatibility issues. They require a solution to serialize and deserialize the DataCatalog object without dependency on Kedro versions.

We propose to explore the feasibility of implementing to_yaml() and from_yaml() methods for the DataCatalog object to facilitate serialization and deserialization without dependency on Kedro versions.

Context

User feedback:

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/mlflow/kedro_pipeline_model.py#L143

# pseudo code
pickle.dumps(catalog)
pickle.loads(catalog) # this will fail if I reload with a newer kedro version and any attributes (even private) has changed. This breaks much more often that we should expect. 

"It would be much more robust to be able to do this":

# pseudo code
catalog.serialize("path/catalog.yml") # name TBD: serialize? to_config? to_yaml? to_json? to_dict? 
catalog.deserialize(catalog) # much more robust since it is not stored as python object -> maybe catalog.from_config? 

Extra context: https://github.com/kedro-org/kedro/issues/3995#issuecomment-2419884227

astrojuanlu commented 5 months ago

Very similar to DataCatalog.from_file proposal discussed in #2967

datajoely commented 1 month ago

I like to_yaml() and from_yaml() personally.

ElenaKhaustova commented 1 week ago

From the user feedback, we can define three main pain points to address:

  1. Compiling catalog into some format allowing easy its assessment, for example, to make sure all factories are resolved as expected
  2. Saving/loading catalog configuration only without pickling
  3. Saving/loading modified catalog, including configuration and data

The first two pain points can be addressed by:

The third one requires 1 and 2 solved and solving data saving part.

The plan for now is to address 1 and 2 first.