Open merelcht opened 2 weeks ago
One push here - and it's already addressed some what in your excellent write up @merelcht, is the fact that we should look to delegate / integrate as much as we can.
The data catalog was the first part of Kedro ever built, it was leap forward for us in 2017 but the industry has matured so much in that time. We should always provide an accessible, novice user UX... but I think now is the time for interoperability.
DataCatalog2.0
design board: https://miro.com/app/board/uXjVKy77OMo=/?share_link_id=278933784087
Love the board @ElenaKhaustova there is an argument we should have all of those in PUML on the docs too 🤔 bit like #4013
The current DataCatalog
implementation effectively addresses basic tasks such as preventing users from hardcoding data sources and standardising I/O operations. However, our goal is to enhance and extend its capabilities to reduce maintenance burdens, integrate seamlessly with enterprise solutions, and leverage the strengths of existing tools like DeltaLake and UnityCatalog. By keeping the datasets-based solution while integrating with UnityCatalog, we aim to cover a broader range of use cases and improve flexibility and functionality. With this redesign work, we aim to:
Develop a flexible API and specific data catalogs, allowing switching between different catalog implementations and utilising their feature while keeping the current dataset-based solution with adjustments.
AbstractDataCatalog
Implement an AbstractDataCatalog
with common methods - exposed to framework: load()
, save()
, confirm()
, release()
, exists()
, shallow_copy()
, and __contains__()
(subject to change?). KedroDataCatalog
and UnityDataCatalog
.
Common methods will be exposed to the framework. Methods not supported by a specific catalog will raise a "Not supported for this type of catalog" exception. KedroDataCatalog
will keep the current dataset-based solution with the updates needed to address the pain points identified at the user research interview.
Description
The current
DataCatalog
in Kedro has served its purpose well but has limitations and areas for improvement identified through user research: https://github.com/kedro-org/kedro/issues/3934As a result of
DataCatalog
user research interview we have created the list of tickets and split them into 3 categories:Addressing issues from 2. and 3. requires significant changes and the introduction of new features and concepts that go beyond the scope of incremental updates.
The objective is to design a new, robust, and modular
DataCatalog2.0
(a better name is welcomed) that incorporates feedback from the community, follows best practices, and integrates new features seamlessly.While redesigning we plan for a smooth migration from the current
DataCatalog
toDataCatalog2.0
, minimizing disruption for existing user.Context
Suggested prioritisation and tickets opened: https://github.com/kedro-org/kedro/issues/3934#issuecomment-2153342972
Related topics
https://github.com/kedro-org/kedro-starters/tree/main/standalone-datacatalog https://github.com/kedro-org/kedro/issues/2901 https://github.com/kedro-org/kedro/issues/2741
Next steps
DataCatalog
architecture.Unity Catalog
,Polaris
,dlthub
, other ?) address similar tasks and challenges.DataCatalog2.0
.