kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Design `DataCatalog2.0` #3995

Open merelcht opened 2 weeks ago

merelcht commented 2 weeks ago

Description

The current DataCatalog in Kedro has served its purpose well but has limitations and areas for improvement identified through user research: https://github.com/kedro-org/kedro/issues/3934

As a result of DataCatalog user research interview we have created the list of tickets and split them into 3 categories:

  1. Tickets that require low or medium effort and can be implemented for the current DataCatalog version without introducing any breaking changes. (partially addressed)
  2. Tickets with all effort/impact levels that should be implemented for the new catalog version - DataCatalog2.0. These tickets include significant catalog redesign steps before the implementation.
  3. Tickets mostly with a high effort level which didn't fall into the previous item because they are touching some significant conceptual changes or new features that require a decision on whether we want to implement them.

Addressing issues from 2. and 3. requires significant changes and the introduction of new features and concepts that go beyond the scope of incremental updates.

The objective is to design a new, robust, and modular DataCatalog2.0 (a better name is welcomed) that incorporates feedback from the community, follows best practices, and integrates new features seamlessly.

While redesigning we plan for a smooth migration from the current DataCatalog to DataCatalog2.0, minimizing disruption for existing user.

Context

Suggested prioritisation and tickets opened: https://github.com/kedro-org/kedro/issues/3934#issuecomment-2153342972

Related topics

https://github.com/kedro-org/kedro-starters/tree/main/standalone-datacatalog https://github.com/kedro-org/kedro/issues/2901 https://github.com/kedro-org/kedro/issues/2741

Next steps

  1. Create a detailed diagram of the current DataCatalog architecture.
  2. Identify components, their interactions, and existing data flows.
  3. Highlight pain points and limitations identified during user research.
  1. Research and document how similar projects (such as Unity Catalog, Polaris, dlthub, other ?) address similar tasks and challenges.
  2. Identify features, implementations, and best practices from these projects and map them with features requested/insights obtained during user research.
  3. Analyze their architectural approaches and note down pros and cons.
datajoely commented 2 weeks ago

One push here - and it's already addressed some what in your excellent write up @merelcht, is the fact that we should look to delegate / integrate as much as we can.

The data catalog was the first part of Kedro ever built, it was leap forward for us in 2017 but the industry has matured so much in that time. We should always provide an accessible, novice user UX... but I think now is the time for interoperability.

ElenaKhaustova commented 1 week ago

DataCatalog2.0 design board: https://miro.com/app/board/uXjVKy77OMo=/?share_link_id=278933784087

datajoely commented 1 week ago

Love the board @ElenaKhaustova there is an argument we should have all of those in PUML on the docs too 🤔 bit like #4013

ElenaKhaustova commented 5 days ago

Value Proposition and Goals for DataCatalog Redesign

The current DataCatalog implementation effectively addresses basic tasks such as preventing users from hardcoding data sources and standardising I/O operations. However, our goal is to enhance and extend its capabilities to reduce maintenance burdens, integrate seamlessly with enterprise solutions, and leverage the strengths of existing tools like DeltaLake and UnityCatalog. By keeping the datasets-based solution while integrating with UnityCatalog, we aim to cover a broader range of use cases and improve flexibility and functionality. With this redesign work, we aim to:

Strategy and Implementation Plan

Develop a flexible API and specific data catalogs, allowing switching between different catalog implementations and utilising their feature while keeping the current dataset-based solution with adjustments.

  1. Design AbstractDataCatalog Implement an AbstractDataCatalog with common methods - exposed to framework: load(), save(), confirm(), release(), exists(), shallow_copy(), and __contains__() (subject to change?).
  2. Create specific implementations for KedroDataCatalog and UnityDataCatalog. Common methods will be exposed to the framework. Methods not supported by a specific catalog will raise a "Not supported for this type of catalog" exception. KedroDataCatalog will keep the current dataset-based solution with the updates needed to address the pain points identified at the user research interview.
  3. Define key features for Integration Learn the UnityCatalog API and how it aligns with its open-sourced part. Identify the top UnityCatalog features that will provide the most value for integration for POC.
Screenshot 2024-07-22 at 12 50 52