kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Catalog serialization and deserialization support #3932

Open ElenaKhaustova opened 3 weeks ago

ElenaKhaustova commented 3 weeks ago

Description

  1. Users admit the lack of persistency in the add workflow, as there is no built-in functionality to save modified catalogs.
  2. Users express the need for an API to save and load catalogs after compilation or modification by converting catalogs to YAML format and back.
  3. Users encounter difficulties loading pickled DataCatalog objects when the Kedro version changes when loading, leading to compatibility issues. They require a solution to serialize and deserialize the DataCatalog object without dependency on Kedro versions.

We propose to explore the feasibility of implementing to_yaml() and from_yaml() methods for the DataCatalog object to facilitate serialization and deserialization without dependency on Kedro versions.

Context

User feedback:

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/mlflow/kedro_pipeline_model.py#L143

# pseudo code
pickle.dumps(catalog)
pickle.loads(catalog) # this will fail if I reload with a newer kedro version and any attributes (even private) has changed. This breaks much more often that we should expect. 

"It would be much more robust to be able to do this":

# pseudo code
catalog.serialize("path/catalog.yml") # name TBD: serialize? to_config? to_yaml? to_json? to_dict? 
catalog.deserialize(catalog) # much more robust since it is not stored as python object -> maybe catalog.from_config? 
astrojuanlu commented 3 weeks ago

Very similar to DataCatalog.from_file proposal discussed in #2967