kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

[DataCatalog]: Simplify the way to access catalog #3923

Open ElenaKhaustova opened 3 weeks ago

ElenaKhaustova commented 3 weeks ago

Description

Currently, there are two ways of accessing catalog: use DataCatalog.load_from_config() method or instantiate a KedroSession, load context and access catalog from there.

Users point that:

We propose to explore the feasibility of developing a clear and intuitive API for accessing the catalog directly from a Kedro project, eliminating the need for a session / hiding session creation.

Context

The current method for acquiring the Data Catalog is cumbersome and involves multiple complex steps, making it less user-friendly. The necessity to initiate a Kedro session and create a context adds unnecessary complexity for users who simply want to access the catalog. The pain point identified involves the complexity and inconsistency in accessing the data catalog from a Kedro project. The user highlights that obtaining the catalog typically requires navigating the Kedro documentation to find the appropriate code snippet to copy and paste, which is cumbersome and inefficient. To address this issue, the user created a custom function, catalog_from_project(), to streamline the process. This function simplifies the task but also suggests that such a utility might be beneficial if included directly within Kedro itself, improving accessibility and user experience.

Screenshot 2024-06-04 at 14 24 08

Frequent changes in this methods for acquiring a Kedro catalog across different versions (such as changes from Kedro 0.16 to 0.17) create difficulties in maintaining compatibility. This variability requires developers to implement complex logic in plugins like Kedro-viz to adapt to version differences.

Some users suggest having read-only DataCatalog Instance: creating a data catalog instance, at least for read-only use cases, which do not rely on creating a full-blown Kedro session.

Implementation Notes

The session creation step is needed to apply hooks that can change the catalog upon loading, so it can be hard to eliminate session creation completely. We can consider encapsulating session creation logic and providing an interface such as from kedro.framework.project.session.context import catalog or/andfrom kedro.framework.project import catalog with or without session creation.

astrojuanlu commented 3 weeks ago

The boilerplate required to extract the catalog from the session is clear.

Do we have any insight on what's difficult about

from kedro.config import OmegaConfigLoader
from kedro.io import DataCatalog

conf_loader = OmegaConfigLoader(conf_source="conf", base_env="base", default_run_env="local")
conf_catalog = conf_loader["catalog"]

catalog = DataCatalog.from_config(conf_catalog)

?

(Asking because this was discussed in https://github.com/kedro-org/kedro/issues/2967)

astrojuanlu commented 3 weeks ago

If we focus this issue on how to access the catalog for an existing project or session though, this is more of a Kedro Framework issue and not a DataCatalog API issue (which should stand on its own, unaware of the Framework).

merelcht commented 3 weeks ago

From reading this issue it sounds to me that these users aren't aware of getting the catalog via the configloader like @astrojuanlu shows in the snippet above. We have worked on improving that massively for the 0.19.0 release, so I would personally leave this for now and not do anything other than maybe going back to the people who mentioned this and send them the docs.