kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Research summary of insights for redesigning Kedro's data catalog API #3934

Open iamelijahko opened 3 weeks ago

iamelijahko commented 3 weeks ago

Why are we doing this research?

Problem

The DataCatalog API, an older Kedro component, needs a refactor to better align with user needs. GitHub threads reveal confusion with the current API and its io.core and io.data_catalog workings, suggesting a general rethink is necessary.

This research aims to reimagine the DataCatalog as a flexible, intuitive tool for managing datasets, addressing the limitations of the current "frozen" datasets view, and reducing reliance on undocumented private APIs.

Hypothesis

We believe by enhancing the flexibility and visibility of the DataCatalog, including providing official support for dynamic interactions with datasets, for advanced Kedro users and plugin developers will achieve a more intuitive and efficient data management experience, reducing the reliance on undocumented private APIs and improving overall project outcomes.

What do we want to learn?

Objectives

Identify the specific limitations and challenges faced by advanced Kedro users and plugin developers when interacting with the current DataCatalog, particularly its "frozen" datasets view, and their reliance on undocumented private APIs.

Understand how these identified limitations impact the workflow, efficiency, and outcomes of projects using Kedro, particularly focusing on the user experience and the technical constraints they encounter.

Research Questions

  1. Who are using the Data Catalog API/private API in Kedro (role)?
  2. What purposes are users utilizing the Data Catalog API within Kedro/their private API, and how does it support their specific workflow or project needs?
  3. When: Stages of the projects do users typically interact with the Data Catalog API/their private API in Kedro?
  4. Why do users opt for the Data Catalog API/their private API in Kedro, and what objectives are they aiming to fulfill?
  5. Where: Environments, platforms, and toolings are users primarily using the Data Catalog API/their private API in Kedro, and how does this affect their usage patterns?

Value Prop

DataCatalog_ValueProp

Research Methodology

Phase 1: Questionnaires Phase 2: Interviews
Objective To gain a general understanding of our advanced data catalog users, their goals, the types of APIs they use, their interaction stages, and their environment. To uncover reasons for using private Data Catalog APIs through user workflows, discussing pain/pleasure points, and asking questionnaire specifics.
Research format A questionnaire is sent via email that takes approximately 15-20 minutes to complete. 1-1 remote Zoom interviews, lasting 45 mins per session.
Date conducted 11th March 2024 - 20th May 2024 18th April 2024 - 9th May 2024
Number of participants 10 groups (13 participants) 6 groups (7 participants)
Links Questionnaire results Interview results / Interview recordings (dovetail)

Who are our Advanced Users?

We define advanced users as those with experience in managing and accessing Kedro DataCatalog, using the "frozen" datasets view, and seeking undocumented private APIs for specific project needs.

These includes 6 internal users/groups and 4 external users/groups.

Persona Archetypes

Data Analysts Software Developers / Data Scientists Plug-in Developers
Technical Expertise Medium Medium-high High
Goals • They might only use DataCatalog outside of the Kedro pipeline.

• Workflows include configure the catalog, load data, perform data analysis, and save intermediate and final results.
• They use Kedro for developing solutions based on Kedro pipelines.

• Workflows include configure the catalog, validate it, ensure it contains the correct data, access metadata, load datasets, perform analysis, run the pipeline, and obtain pipeline outputs.
• They use Kedro when developing plugins related to the Kedro framework (Kedro-Viz, AzureML, Vizro, etc.).

• Workflows include access the catalog and datasets' metadata and modify the catalog on the fly (dataset attributes) after it is created.
DataCatalog API Usage • They focus mainly on exploration and development in notebooks rather than production. They are familiar only with public APIs, mostly load() and save(). • They mainly use public and private APIs to access dataset objects by name instead of via property. • They extensively use private APIs mainly because frozen datasets do not allow modifications.
Tools • Kedro DataCatalog, Jupyter Notebooks, Pandas, SQL, and Excel. • Kedro framework, Kedro DataCatalog, Jupyter Notebooks, Pandas, SQL, Git, IDEs. • Kedro framework, Kedro DataCatalog, Python, ML frameworks, Git, IDEs.
Pain-points • Express the need for autocomplete functionality when accessing datasets in the catalog.

• Struggle with understanding debug errors.

• Are required to install all dependencies even for unused datasets.

• Struggle to find datasets within the catalog, particularly when dealing with a large number of datasets
• Confused with FrozenDatasets Public API due to unclear documentation and limitations: get dataset by name, iterate through datasets and get metadata.

• Express the need for an improved visual representation of the catalog when printing.

• Express the need for autocomplete functionality.

• Challenges with accessing and managing dataset filepaths.
• Lack of public methods to dynamically modify catalog datasets or parameters during pipeline execution.

• Challenges with accessing and managing dataset filepaths.

• Confusion with FrozenDatasets public API.

• Complications in dataset pattern resolution.

• Complexity of accessing the catalog from Kedro session.

Overall Observations

DataCatalog_OverallObservation1_UserJourney

1. Ease of Multi-Source Configuration

Synthesis

Insight Action Pain-point Feature request Tag
1. Catalog serialization and deserialization support - Users admit the lack of persistency in the add workflow, as there is no built-in functionality to save modified catalogs.
- Users express the need for an API to save and load catalogs after compilation or modification by converting catalogs to YAML format and back.
- Users encounter difficulties loading pickled DataCatalog objects when the Kedro version changes when loading, leading to compatibility issues. They require a solution to serialize and deserialize the DataCatalog object without dependency on Kedro versions.
To explore the feasibility of implementing to_yaml() and from_yaml() methods for the DataCatalog object to facilitate serialization and deserialization without dependency on Kedro versions. Add workflow is missing persistency (you can not save modified catalog). - Catalog to YAML function is needed to save modified catalog.
- Users need to_yaml(), from_yaml() methods to avoid issues with pickling catalog objects.
- Function to compile catalog and showcase the result.
Exploration needed

New functionality
2. Simplify the way to access catalog Currently, there are two ways of accessing catalog: use DataCatalog.load_from_config() method or instantiate a KedroSession, load context and access catalog from there.
- Users address accessing the catalog from a Kedro session is complex and requires an understanding of framework details, such as project creation and environment setup;
- Users address acquiring the catalog involves writing a lot of code and navigating through parameters that are out of the context of their work;
- Users address creating a Kedro session too heavy for simple catalog reading tasks.
To explore the feasibility of developing a clear and intuitive API for accessing the catalog directly from a Kedro project, eliminating the need for a session / hiding session creation. - Acquire catalog in the first place is very clunky and requires a lot of code.
- When creating a session users have to care about the path to a kedro project, env and other parameters which might be irrelevant for user use-cases (Vizro).
- Creating kedro session seems too heavy for just reading the catalog.
- Clear API to get catalog from kedro project is needed.
- Way to create a catalog without instantiating kedro session for read only purposes.
- An easier method to access the catalog directly, without the need for a session or the complications of hooks, would significantly improve usability.
3. Refactor dataset factory resolution logic - The current design complicates dataset pattern resolution, leading to confusion.
- Resolution logic residing in the private _get_dataset() method forces people to stick to private API since using the public exists() method instead is not straightforward.
- Developers often forget that dataset factory resolution requires _get_dataset(), leading to further bugs.
- Resolution logic duplicates between DataCatalog class and CLI, making it harder to maintain.
- Move the resolution logic out of the _get_dataset() and make it standard across all the modules and available for users via public API.
- Explore the feasibility of implementing simpler resolution logic for dataset factories to ensure that datasets are resolved when needed without iterating through all of them.
- Enhance documentation for advanced users to clearly explain the dataset resolution process and the usage of dataset factories.
- Pattern resolution lives in _get_dataset method logic, so public property doesn’t work.
- Developers often forget that datasets factory resolution needs _get_dataset() and it leads to bugs.
- Datasets factory resolution needs _get_dataset() method, that’s why they call exists() when it logically not required.
- If there was a way to register the factory datasets - wouldn't be necessary to use _get_dataset.
- Dataset factories are resolved lazily - design choice on Kedro side.
4. Improve the way to access namespaced datasets with _FrozenDataset API Users struggle with the _FrozenDataset's API when accessing namespaced datasets because it uses double underscores instead of dots, which they find unintuitive and cumbersome. Some prefer referring to the dataset by its original name, so they use the private _get_dataset() method instead. - Explore the feasibility of modifying the _FrozenDataset's API to use dots instead of double underscores for namespaces, aligning with users' expectations.
- Provide an opportunity to call datasets by their exact names - get dataset by name function.
- FrozenDataset's API is not convenient to access namespaced datasets, double underscore is not intuitive.
- The use of double underscores instead of dots for namespaces in the catalog is unintuitive for users.
- Attribute Replacement: C1 finds the replacement of characters like “.” or “@” with “__” in dataset names to be unclean and prefers calling datasets by their exact names.
---
5. Exploring DataCatalog as a standalone component for broader adoption and integration Several teams are already utilizing the catalog as a standalone component, demonstrating a significant demand for this functionality. The use cases include collaboration between teams, sharing catalogs without the framework, and integration with other pipeline systems like Metaflow. Recognizing the existing adoption of the catalog as a standalone component highlights its potential value outside the context of the framework. Explore the possibility of making DataCatalog a standalone component (move it outside of the framework). --- ---
6. Enhance _FrozenDatasets public API Users face challenges with understanding and effectively utilizing the _FrozenDatasets public API due to unclear documentation and limitations. They struggle to get dataset by name, iterate through datasets and get metadata. They express uncertainty about the advantages of using _FrozenDatasets, and find it unintuitive to work with due to its underscore prefix and limited functionality compared to the private API. - Enhance the FrozenDatasets public API to provide more comprehensive functionality, including the ability to iterate over the datasets, access detailed metadata, and utilize methods like get_by_name() for flexible dataset retrieval.
- Increase awareness of the FrozenDatasets API among users through tutorials, and documentation updates. Highlight the capabilities of the public API and provide guidance on how to use it effectively for dataset management and retrieval.
- Consider allowing DataCatalog modifications and getting rid of _FrozenDatasets - this is a broader question related to another issue that will be linked later.
- It's unclear how to use FrozenDatasets: the class itself starts with an underscore so this doesn't really feel safe  to loop over a catalog.dataset.
- Not easy to iterate all of the datasets: public API do not allow it, so you have to iterate via names and use private _get_dataset() method.
- With _FrozenDatasets you can only access datasets as attributes but not using get_by_name() method.
- Public API is limited with searching by name, save and load while access to more detailed metadata is not available.
---
Pretty printing Compiling the catalog at runtime hinders users' ability to assess its structure and contents effectively. They express the need for an improved visual representation of the catalog when printing. - Explore the feasibility of developing a dedicated function to compile the catalog.
- Implement "pretty printing" function specifically tailored to improve the visual representation of the catalog when printed or displayed.
Cannot compile catalog and showcase the result as compilation happens at runtime. - Nicer representation when printing catalog needed.
- Need catalog pretty printing function.
- Function to compile catalog and showcase the result.
8. Autocompletion support for accessing datasets Users express the need for autocomplete functionality when accessing datasets in the catalog. --- --- Implement autocompletion support for accessing datasets in the catalog, enabling users to receive suggestions for dataset names as they type.

1. Catalog serialization and deserialization support

Insight:

2. Simplify the way to access catalog

Insight:

Currently, there are two ways of accessing catalog: use DataCatalog.load_from_config() method or instantiate a KedroSession, load context and access catalog from there.

3. Refactor dataset factory resolution logic

Insight:

4. Improve the way to access namespaced datasets with _FrozenDataset API

Insight:

5. Exploring DataCatalog as a standalone component for broader adoption and integration

Insight:

6. Enhance _FrozenDatasets public API

Insight:

7. Pretty printing

Insight:

8. Autocompletion support for accessing datasets

Insight:

ElenaKhaustova commented 3 weeks ago

Suggested prioritisation

The prioritization below is done based on the priority matrix where we aim for the tickets on the left top with high user impact and low implementation effort.

Additionally, we suggest splitting all the tickets into three categories (tickets are sorted by high importance and low effort within each category):

  1. Tickets that require low or medium effort and can be implemented for the current DataCatalog version without introducing any breaking changes.
  1. Tickets with all effort/impact levels that should be implemented for the new catalog version - DataCatalog2.0. These tickets include significant catalog redesign steps before the implementation.
  1. Tickets mostly with a high effort level which didn't fall into the previous item because they are touching some significant conceptual changes or new features that require a decision on whether we want to implement them.
datajoely commented 3 weeks ago

In the section 3 I have two ambitious wishes inspired by Dagster: