kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Refactor dataset factory resolution logic #3925

Open ElenaKhaustova opened 3 weeks ago

ElenaKhaustova commented 3 weeks ago

Description

We propose:

  1. Move the resolution logic out of the _get_dataset() and make it standard across all the modules and available for users via public API.
  2. Explore the feasibility of implementing simpler resolution logic for dataset factories to ensure that datasets are resolved when needed without iterating through all of them.
  3. Enhance documentation for advanced users to clearly explain the dataset resolution process and the usage of dataset factories.

This issue also relates to a more global question raised by @astrojuanlu: "The most important philosophical question here is "opening up" the DataCatalog abstraction and make datasets first-class citizens, and not an implementation detail. This was mentioned as far back as 2022 https://github.com/kedro-org/kedro/issues/1778#issuecomment-1292054256"

Context

Kedro-Viz case

After obtaining the catalog, the next step is to populate the catalog repositories. At this point, there's an encounter with a limitation - the DataCatalog does not include datasets resolved from factory patterns. To overcome the limitation, methods like pipeline.data_sets() and pipeline.datasets() are employed to access datasets, followed by the usage of _get_dataset(). The need arises from the inability of the public API to lazily load datasets and resolve factory patterns, which is necessary for Kedro-Viz's operations, especially before starting the server.

https://github.com/kedro-org/kedro-viz/blob/8fe5fa4810bb639013222d4bf1da3d9d337fb6d3/package/kedro_viz/data_access/managers.py#L72

Screenshot 2024-06-04 at 16 53 50

Screenshot 2024-06-04 at 16 33 59

MLFlow case

They use DataCatlog.exists() method to resolve factory patterns in after_pipeline_run hook to log pipeline artifacts. They find it unintuitive and admit that people often forget about that which leads to bugs that are hard to find.

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/framework/hooks/mlflow_hook.py#L365

Screenshot 2024-06-04 at 16 45 47

Logic duplication

Currently, dataset factory resolution logic resides in two places: the DataCatalog._get_dataset() method and the list_datasets() CLI.

https://github.com/kedro-org/kedro/blob/27f5405cefd6701ffac4c6243030486fb7d3c942/kedro/io/data_catalog.py#L385

https://github.com/kedro-org/kedro/blob/27f5405cefd6701ffac4c6243030486fb7d3c942/kedro/framework/cli/catalog.py#L85

It makes it hard to maintain and keep consistent as every time we need to modify the logic we have to make it in two places and, including tests.

An example of such PR: https://github.com/kedro-org/kedro/pull/3859