[DataCatalog]: Spike - Catalog serialization and deserialization support

ElenaKhaustova commented 5 months ago

Description

Users admit the lack of persistency in the add workflow, as there is no built-in functionality to save modified catalogs.
Users express the need for an API to save and load catalogs after compilation or modification by converting catalogs to YAML format and back.
Users encounter difficulties loading pickled DataCatalog objects when the Kedro version changes when loading, leading to compatibility issues. They require a solution to serialize and deserialize the DataCatalog object without dependency on Kedro versions.

We propose to explore the feasibility of implementing to_yaml() and from_yaml() methods for the DataCatalog object to facilitate serialization and deserialization without dependency on Kedro versions.

Context

User feedback:

Add workflow is missing persistency, so you can not save modified catalog: "You have a catalog and then you start adding extra stuff to it, currently we just throw away those added things when they close a notebook."
Catalog to YAML function is needed to save modified catalog: "People have always asked for it. Could I have a catalog to YAML function so that you could actually spit out the YAML files that are needed to do this again later on?"
Competitors provide the functionality to compile catalog and showcase the result: "I would point to the DPC compile workflow. And actually, if you do DBT run it does DBT compile first and then runs the compiled outputs. Whereas in Kedro, you have your very concise complicated YAML and it will all that compilation happens at run time and there's no way for the user to see it."
When pickling DataCatalog object they experience difficulties in loading it back if the kedro version is different: "Serialization is an issue because I often pickle a catalog (mostly as part of a mlflow model). Pickling the catalog is really something that leads to a lot of problems because if I don't have the exact same Kedro version when I want to load the catalog, if the object has any change inside - private method or attribute it will lead to error."

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/mlflow/kedro_pipeline_model.py#L143

# pseudo code
pickle.dumps(catalog)
pickle.loads(catalog) # this will fail if I reload with a newer kedro version and any attributes (even private) has changed. This breaks much more often that we should expect.

"It would be much more robust to be able to do this":

# pseudo code
catalog.serialize("path/catalog.yml") # name TBD: serialize? to_config? to_yaml? to_json? to_dict? 
catalog.deserialize(catalog) # much more robust since it is not stored as python object -> maybe catalog.from_config?

Extra context: https://github.com/kedro-org/kedro/issues/3995#issuecomment-2419884227

astrojuanlu commented 5 months ago

Very similar to DataCatalog.from_file proposal discussed in #2967

datajoely commented 1 month ago

I like to_yaml() and from_yaml() personally.

It would be nice if we preserved comments and the way the user organised their files before. I appreciate this increases complexity - but it does match the mental model of how the user things about their project.
I'm currently working with Pydantic a lot at the moment, I wonder if it makes sense to use or at least take some inspiration.

ElenaKhaustova commented 1 week ago

From the user feedback, we can define three main pain points to address:

Compiling catalog into some format allowing easy its assessment, for example, to make sure all factories are resolved as expected
Saving/loading catalog configuration only without pickling
Saving/loading modified catalog, including configuration and data

The first two pain points can be addressed by:

Implementing catalog.to_config() method (since we already have catalog.from_config()) - https://github.com/kedro-org/kedro/issues/4329
Implementing a method to save and load catalog obtained from catalog.to_config() - https://github.com/kedro-org/kedro/issues/4330

The third one requires 1 and 2 solved and solving data saving part.

The plan for now is to address 1 and 2 first.

kedro-org / kedro

[DataCatalog]: Spike - Catalog serialization and deserialization support #3932

Description

Context