Design `DataCatalog2.0`

kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

https://kedro.org

Apache License 2.0

10k stars 905 forks source link

Design `DataCatalog2.0` #3995

Open merelcht opened 4 months ago

merelcht commented 4 months ago

Description

The current DataCatalog in Kedro has served its purpose well but has limitations and areas for improvement identified through user research: https://github.com/kedro-org/kedro/issues/3934

As a result of DataCatalog user research interview we have created the list of tickets and split them into 3 categories:

Tickets that require low or medium effort and can be implemented for the current DataCatalog version without introducing any breaking changes. (partially addressed)
Tickets with all effort/impact levels that should be implemented for the new catalog version - DataCatalog2.0. These tickets include significant catalog redesign steps before the implementation.
Tickets mostly with a high effort level which didn't fall into the previous item because they are touching some significant conceptual changes or new features that require a decision on whether we want to implement them.

Addressing issues from 2. and 3. requires significant changes and the introduction of new features and concepts that go beyond the scope of incremental updates.

The objective is to design a new, robust, and modular DataCatalog2.0 (a better name is welcomed) that incorporates feedback from the community, follows best practices, and integrates new features seamlessly.

While redesigning we plan for a smooth migration from the current DataCatalog to DataCatalog2.0, minimizing disruption for existing user.

Context

Suggested prioritisation and tickets opened: https://github.com/kedro-org/kedro/issues/3934#issuecomment-2153342972

Next steps

[x] Diagram Current Design:

Create a detailed diagram of the current DataCatalog architecture.
Identify components, their interactions, and existing data flows.
Highlight pain points and limitations identified during user research.

[x] Analyze Similar Projects:

Research and document how similar projects (such as Unity Catalog, Polaris, dlthub, other ?) address similar tasks and challenges.
Identify features, implementations, and best practices from these projects and map them with features requested/insights obtained during user research.
Analyze their architectural approaches and note down pros and cons.

[x] Propose Initial Design:
1. Based on the research and analysis, propose an initial modular design for DataCatalog2.0.
2. Ensure the design addresses the tickets in group 2 and conceptual changes/new features from group 3.
3. Use the initial design as a starting point for the discussion of the redesign process with tsc.

datajoely commented 4 months ago

One push here - and it's already addressed some what in your excellent write up @merelcht, is the fact that we should look to delegate / integrate as much as we can.

The data catalog was the first part of Kedro ever built, it was leap forward for us in 2017 but the industry has matured so much in that time. We should always provide an accessible, novice user UX... but I think now is the time for interoperability.

Slightly controversial - I think we should make the first party endorsed patern Ibis-centric. It's portable to any back-end reducing our own maintenance burden, lazily evaluated, MICE and enables us to expose things like Column Level Lineage to our users.
We look to delegate versioning/time-travel to modern table formats like Iceberg / Delta etc.
We look to provide easy integration to modern catalog architectures e.g. Unity / Polaris. I would highly recommend reading this series by Chris Riccomini (1, 2, 3)

ElenaKhaustova commented 4 months ago

DataCatalog2.0 design board: https://miro.com/app/board/uXjVKy77OMo=/?share_link_id=278933784087

datajoely commented 4 months ago

Love the board @ElenaKhaustova there is an argument we should have all of those in PUML on the docs too 🤔 bit like #4013

ElenaKhaustova commented 3 months ago

Value Proposition and Goals for DataCatalog Redesign

The current DataCatalog implementation effectively addresses basic tasks such as preventing users from hardcoding data sources and standardising I/O operations. However, our goal is to enhance and extend its capabilities to reduce maintenance burdens, integrate seamlessly with enterprise solutions, and leverage the strengths of existing tools like DeltaLake and UnityCatalog. By keeping the datasets-based solution while integrating with UnityCatalog, we aim to cover a broader range of use cases and improve flexibility and functionality. With this redesign work, we aim to:

Simplify the codebase to make it easier to maintain and address pain points identified at the user research interview for the current implementation.
Ensure backward compatibility to avoid breaking existing workflows.
Offer a mechanism for seamless integration with enterprise data catalogs like UnityCatalog (we chose it to implement POC as it serves ~50% of client use cases, though we would like to focus on the integration API so we can switch between different catalogs in the future).
Start with ETL processes and extend integration to modelling workflows.
Utilize top-n features from UnityCatalog to address common pain points such as versioning format conversion.
Use integrated UnityCatalog features to run Kedro pipelines on various platforms.
Maintain the ability to start smoothly without significant installation overhead.
Retain the current datasets-based solution for user familiarity.

Strategy and Implementation Plan

Develop a flexible API and specific data catalogs, allowing switching between different catalog implementations and utilising their feature while keeping the current dataset-based solution with adjustments.

Design AbstractDataCatalog Implement an AbstractDataCatalog with common methods - exposed to framework: load(), save(), confirm(), release(), exists(), shallow_copy(), and __contains__() (subject to change?).
Create specific implementations for KedroDataCatalog and UnityDataCatalog. Common methods will be exposed to the framework. Methods not supported by a specific catalog will raise a "Not supported for this type of catalog" exception. KedroDataCatalog will keep the current dataset-based solution with the updates needed to address the pain points identified at the user research interview.
Define key features for Integration Learn the UnityCatalog API and how it aligns with its open-sourced part. Identify the top UnityCatalog features that will provide the most value for integration for POC.

ElenaKhaustova commented 3 months ago

Summary of Integration Proposal: Kedro DataCatalog with Unity Data Catalog

How We Plan to Integrate

Unity Catalog Integration Options:

Unity Catalog (Open Source): An open-source version of Unity Catalog accessible via its API.
Unity Catalog (Databricks): The Databricks-hosted version of Unity Catalog, offering enhanced features and accessible via multiple APIs:

Python SDK: Automates operations within Databricks accounts, workspaces, and related resources.
Databricks CLI: A command-line interface that wraps the Databricks REST API.
REST API: Allows direct interaction with Databricks services via HTTP requests.

Screenshot 2024-07-30 at 13 45 30

Local Workflow After Integration:

Kedro interacts with UnityDataCatalog, which encapsulates the open-source Unity Catalog API hosted locally.
Users can switch to the Databricks Unity Catalog hosted on the platform while still running Kedro pipelines locally.
The open-source Unity Catalog can be set up independently via its API.

Remote Workflow After Integration:

Kedro interacts with UnityDataCatalog, encapsulating the Unity Catalog API hosted on Databricks.
Kedro pipelines run remotely via Databricks notebooks or jobs.
Users can use the Unity Catalog UI to set up and manage the catalog.

Recommendation to Start with Databricks Python SDK via Databricks Notebooks

After evaluating the Unity Catalog (open source) and Unity Catalog (Databricks) and their APIs, we recommend starting the integration using the Databricks Python SDK via Databricks notebooks.

Reasons for Recommendation:

Limited Functionality of REST API and Databricks CLI: These interfaces provide table metadata but do not support content retrieval.
Limited Functionality of Open Source Unity Catalog: The REST API for the open-source version is less capable compared to Databricks' REST API, with most examples available in Scala, SQL, or CLI. Elementary operations such as uploading/downloading data or model tracking lack clarity and support.
Flexibility and Maturity of Python SDK:
- The Python SDK running in Databricks notebooks provides comprehensive functionality for working with tables, volumes, and models.
- The Python SDK used locally does not offer the same level of functionality https://github.com/databricks/databricks-sdk-py.
- It is more mature, reducing the risk of breaking changes, and covers most methods exposed by the current Kedro DataCatalog.

Challenges with Integration

Limitations of REST API and Databricks CLI: These interfaces return metadata but not the actual content of tables.
- Unity Catalog API’s REST API provides table information but lacks a mechanism to query table data directly.
- There are ongoing discussions about adding such capabilities, but they are not yet available. source 1, source 2
- While the REST API allows uploading/downloading unstructured data as catalog volumes or via DBFS, but loading tables requires additional mechanisms, similar to current data connectors - so the value of replacing the existing solution with Unity Data catalog, in this case, is unclear.

Screenshot 2024-07-30 at 13 45 53 Screenshot 2024-07-30 at 13 46 41

Limited Functionality of Open Source Unity Catalog:
- The REST API for the open-source version does not support uploading/downloading any data types.

Screenshot 2024-07-30 at 13 47 10

Integration via Platform SDK and Databricks Notebook

Screenshot 2024-07-30 at 13 47 47

UnityDataCatalog can manage schemas, tables, volumes, and models. It will be used as a wrapper to align with Kedro API and "datasets" concept.
The Python SDK aligns with most of the methods that Kedro DataCatalog exposes to Session and Runner, making it sufficient to develop a small Proof of Concept (PoC).
Databricks-hosted Python SDK provides access to a comprehensive set of features, including the Unity Catalog UI, which allows for easier setup and management of the catalog.
The Databricks Python SDK running in notebooks is more stable and less prone to breaking changes.

General concerns regarding integration

Logging models requires using the MLflow API, which interacts directly with Unity Catalog - so there'll be few control from the Kedro side.
The platform exhibits instability; functionalities that work one day may not work the next.
Some features are accessible only via the UI or REST API, and the documentation is often unclear.

noklam commented 3 months ago

Providing some context for "What is UnityCatalog" as I personally find their docs are very confusing. I think the main differentiation is more enterprise focus on governance/access control.

Unity Catalog is a unified governance solution for data and AI assets on Databricks. It is not just a metastore or data connector, but rather a comprehensive system that includes several key components:

Metastore: Unity Catalog uses a metastore as the top-level container for metadata about data assets and permissions. This metastore is similar to but more advanced than traditional Hive metastores.

That's a large part of the RESTful API described above, it store metadata and connectors consume the metadata and act almost like a dblink

Three-level namespace: Unity Catalog organizes data assets using a three-level namespace hierarchy: catalog > schema (database) > table/view/volume. This allows for better organization and governance of data assets.

Access control: It provides centralized access control across Databricks workspaces, allowing administrators to define data access policies in one place.

Main feature for access management.

Auditing and lineage: Unity Catalog automatically captures user-level audit logs and data lineage information.

I haven't seen too much about this

Data discovery: It includes features for tagging, documenting, and searching data assets to help users find the data they need.

Metastore + UI as a shop for data source.

Support for multiple object types: Unity Catalog can manage metadata for tables, views, volumes, models, and functions.

Flexible storage options: It allows configuring storage locations at the metastore, catalog, or schema level to meet various data storage requirements.

A bit similar to what fsspec does to Kedro, but expand beyond remote storage (Volume). They also have models and tables

noklam commented 3 months ago

With that in mind, my questions are:

What are the benefits integrating with UnityCatalog?
Does Kedro users need to interact with UnityCatalog? Databricks users today can already use UnityCatalog without any integration. As Spark communicate with catalog in the background, for end users they are just declaring the table name.
- Questions, in case of using pandas with UnityCatalog, what does it do exactly or is it merely reading data from some kind of remote storage? I can't find any meaningful example beyonds Spark/Delta in their docs.
- The UnityCatalog & DataCatalog abstraction is on different level - as Kedro's DataCatalog are mainly DataConnector (i.e. how to load a parquet file) while UnityCatalog answer things like (i.e. where is table a.b.c -> s3/some_directory/abc.parquet), it doesn't have information about how a data source should be consumed.

With that in mind, does it makes sense to focus on a subset/Databricks native workflow (Spark/Delta/pandas workflow).

I also wonder, should we put all the focus on UnityCatalog/other catalog? or should it be more around the API change for better interactive use case (i.e. using DataCatalog in a notebook).

ElenaKhaustova commented 3 months ago

With that in mind, my questions are:

What are the benefits integrating with UnityCatalog?

Does Kedro users need to interact with UnityCatalog? Databricks users today can already use UnityCatalog without any integration. As Spark communicate with catalog in the background, for end users they are just declaring the table name.

Questions, in case of using pandas with UnityCatalog, what does it do exactly or is it merely reading data from some kind of remote storage? I can't find any meaningful example beyonds Spark/Delta in their docs.

The UnityCatalog & DataCatalog abstraction is on different level - as Kedro's DataCatalog are mainly DataConnector (i.e. how to load a parquet file) while UnityCatalog answer things like (i.e. where is table a.b.c -> s3/some_directory/abc.parquet), it doesn't have information about how a data source should be consumed.

With that in mind, does it makes sense to focus on a subset/Databricks native workflow (Spark/Delta/pandas workflow).

I also wonder, should we put all the focus on UnityCatalog/other catalog? or should it be more around the API change for better interactive use case (i.e. using DataCatalog in a notebook).

For me, the value is not particularly in integrating with Unity Catalog but in exploring ways to extend the current DataCatalog concept with new approaches. Currently, maintaining custom data connectors introduces additional overhead and complexity. By leveraging existing mechanisms like Unity Catalog, we can potentially simplify data access and reduce maintenance burdens. This doesn't mean we plan to completely move to this new approach; rather, we want to test its feasibility and value.

The Unity Catalog and DataCatalog abstractions indeed operate at different levels. However, Unity Catalog includes abstractions such as tables, volumes, and models, which can be aligned with Kedro's DataCatalog abstractions. Our goal with the PoC is to verify this alignment. Unity Catalog, particularly when accessed via the Python SDK in Databricks notebooks (see the above summary), provides a comprehensive set of features (tables, models, volumes) and a UI, making it a suitable candidate for our experiment. Given that many clients use Databricks for development and deployment, it makes sense to start here. Based on the PoC results, we can decide whether to pursue further integration with Unity Catalog, consider other catalogs, or stick with our current solution.

Our main focus remains on improving Kedro's DataCatalog based on insights from user research and interviews.

astrojuanlu commented 3 months ago

Thanks a lot for the extensive research on the Unity Catalog @ElenaKhaustova 🙏🏼

My only point is that we should not make anything that's specific to Databricks Unity Catalog (since it's a commercial system) and it's a bit too early to understand how the different dataframe libraries and compute engines will interact with such metastores https://github.com/unitycatalog/unitycatalog/discussions/208#discussioncomment-10208766. At least, now that Polaris has just been open sourced https://github.com/polaris-catalog/polaris/pull/2, we know that the Apache Iceberg REST API "won", so if anything we should take that REST API as the reference.

More questions from my side:

Have we considered designing a PEP 544 CatalogProtocol rather than an AbstractCatalog?
It's not clear to me how the proposed public methods address dataset access, in other words https://github.com/kedro-org/kedro/issues/3916. It was highlighted as one of the major pain points in https://github.com/kedro-org/kedro/issues/3934. I see a _datasets property in your summary https://github.com/kedro-org/kedro/issues/3995#issuecomment-2258346648 but don't we want to offer a public, documented way of iterating through the datasets?
What's the rationale behind shallow_copy?
Are we considering adding any affordances for serialization/deserialization, in line with the pain point highlighted in https://github.com/kedro-org/kedro/issues/3932 ?

ElenaKhaustova commented 3 months ago

Thank you, @astrojuanlu!

I fully agree with your points about not tightening to specific catalogs. The truth is that we are still determining whether we want to integrate with UnityCatalog/Polaris or something else. The answer might change depending on how they develop in the near future. That's why we suggest focusing on improving Kedro's DataCatalog, a solution shaped by insights from user research and interviews, and treating the integration part as research with a PoC as a target result.

In order to work on those two goals in parallel, we plan to start with moving shared logic to the AbstractDataCatalog and implementing KedroDataCatalog with the following improvements: https://github.com/kedro-org/kedro/issues/3925, https://github.com/kedro-org/kedro/issues/3916, https://github.com/kedro-org/kedro/issues/3926, https://github.com/kedro-org/kedro/issues/3931. Once it's done, we'll be able to work on UnityDataCatalog and add more complex features from https://github.com/kedro-org/kedro/issues/3934, such as serialization/deserialization, to KedroDataCatalog.

Answering other questions:

I like the idea of CatalogProtocol and will keep it in mind, but I suggested AbstractCatalog as it seems there will be some shared logic, such as pattern resolution, which falls into the ABCs concept. PoC should show whether it's a good idea and what constraints we might get with the suggested approach.
As for _datasets, shallow_copy and other properties/methods I mentioned above - that's not the suggested interface, but it's a DataCatalog interface used within session, runner and pipeline, meaning it will most likely remain in some way as we do not plan to rewrite the whole framework. However, it doesn't mean we do not move from _datasets to datasets and so on, so it can change, but its primary purpose will probably remain the same.

datajoely commented 3 months ago

I just want to say I love the direction this is going ❤️ , great work folks

ElenaKhaustova commented 3 months ago

We picked the following tickets: https://github.com/kedro-org/kedro/issues/3925, https://github.com/kedro-org/kedro/issues/3926, https://github.com/kedro-org/kedro/issues/3916 and https://github.com/kedro-org/kedro/issues/3931 as a starting point for the implementation of AbstractDataCatalog and KedroDataCatalog.

The following PRs include the drafts of AbstractDataCatalog, KedroDataCatalog and updated CLI logic.

Draft of AbstractDataCatalog and KedroDataCatalog, refactoring factory resolution logic and dataset access: https://github.com/kedro-org/kedro/pull/4070
Corresponding changes for catalog CLI commands: https://github.com/kedro-org/kedro/pull/4071

Mentioned PRs include a draft of the following:

Implement draft of AbstractDataCatalog and KedroDataCatalog(AbstractDataCatalog)

AbstractDataCatalog now supports instantiation from configuration and/or datasets via constructor
AbstractDataCatalog stores the configuration provided

Rework dataset pattern resolution logic:

Pattern resolution logic moved out from _get_dataset() to resolve_patterns()
Pattern resolution logic split into actual resolution, updating datasets/configurations and instantiating datasets
_dataset_patterns and _default_patterns now obtained from config at the __init__
Added resolved_ds_configs property to store resolved datasets' configurations at the catalog level
Encapsulated CLI logic to catalog: reimplemented kedro catalog list/rank/resolve/create
add() method adds or replaces the dataset and its configuration
add_feed_dict() renamed to add_from_dict()
introduces _runtime_patterns catalog field to keep the logic of processing dataset/default/runtime patterns clear
removed shallow_copy() method used to add extra_dataset_patterns at runtime, replaced it with dedicated - add_runtime_patterns() method

Rework dataset access logic

Removed _FrozenDatasets and access datasets as properties
Add get dataset by name feature: dedicated function and access by key
Added iterate over the datasets feature
We still do not allow to modify dataset property but allow add(replace=True)

Make KedroDataCatalog mutable:

We do not want to make datasets property public not to encourage behaviour when users configure the catalog via modifying the datasets dictionary
_datasets property remained protected, but public datasets property was added, returning a deep copy of _datasets while the setter is still not allowed; the same is applied to the _resolved_ds_configs property
One can still extend and replace _datasets via the catalog.add() method

To make AbstractDataCatalog compatible with the current runners' implementation several methods - release(), confirm() and exists() were kept as the part of interface. But they only have a meaningful implementation for KedroDataCatalog

Some explanations behind the decisions made:

Split resolution logic from datasets init gives the flexibility to resolve configurations without creating datasets and is required for the lazy loading feature
We now store datasets' configurations for to_dict()/from_dict() feature
We removed from_config() method to allow instantiation from config and/or datasets via the constructor

After a brief discussion of changes made with @idanov and these features: https://github.com/kedro-org/kedro/issues/3935 and https://github.com/kedro-org/kedro/issues/3932 we would like to focus on the following topics:

Do we need datasets property? After the changes were made, the value of dataset property became unclear, so we want to consider removing it. The alternative discussed is to change a key access interface to work with data but not the datasets. Now catalog["dataset_name"] gives the Dataset object, but it can return the data instead as if we do catalog["dataset_name"].load(). The last may simplify an interface by removing load() and save() methods.
Datasets require corresponding changes to store configuration or the ability to recover it from the object for proper implementation of to_dict()/from_dict() feature. Currently, if catalog was instantiated from datasets, we do not know their configuration.
Lazy loading. Since resolution logic is now decoupled from instantiating logic we can postpone the actual datasets instantiation and only make it when load or save data. We want to explore what should be the default behaviour. We consider adding a flat to enable/disable lazy loading and keep it disabled by default to avoid the case when the pipeline fails at the very end because some package is missing. However, we can consider automatically enabling it based on some events, such as pipeline slicing.
How do we want to apply the proposed changes? In the opened PRs, all the changes are implemented for the separate components so that the old DataCatalog remains the same. With this approach, we stack one PR on top of another and will have to merge all of them at the end. Another approach is moving the changes proposed incrementally to the existing DataCatalog, trying to make them non-breaking so users can try new features while the rest are in development.

ElenaKhaustova commented 3 months ago

The following PR https://github.com/kedro-org/kedro/pull/4084 includes updates required to use both DataCatalog and AbstractDataCatalog to run the kedro project. It roughly shows the changes needed on the framework side to keep both versions working together.

We also tried an approach when moving the changes proposed incrementally to the existing DataCatalog, trying to make them non-breaking so users can try new features while the rest are in development. This involved moving the existing DataCatalog to the AbstractDataCatalog so we can use catalogs inherited from AbstractDataCatalog in the framework. But since the catalog initialisation has changed significantly in the AbstractDataCatalog, it led to the overloading AbstractDataCatalog with fields, parameters and methods needed only for DataCatalog - mixing both implementations. So, it was decided that we should not go with this approach.

There's also an approach where we use the develop branch to replace DataCatalog with AbstractDataCatalog straightaway and only merge to main when it's finished and we make a breaking release. The biggest drawback is that it will require more time until the full completion and users won't be able to use the new catalog during the development.

Based on the above, the further suggested approach is:

Keep both implementations together so that users can switch between them and use both of them - it will require some extra implementation effort but allow testing the new catalog and keeping it separate from the old one;
Implement basic functionality for KedroDataCatalog(AbstractDataCatalog) that covers DataCatalog
Keep adding features for KedroDataCatalog
Run experiments with UnityDataCatalog (the expected result is PoC with working spaceflights pipelines with UnityDataCatalog at Databricks) and, based on the results, decide regarding the AbstractDataCatalog interface.
Completely switch to KedroDataCatalog after some time when we decide to do the breaking release

Other things to take into consideration:

ParallelRunner expects datasets to be mutable; Do we want to keep them mutable for ourselves but not for users?
If we store dataset names on the dataset side the data catalog interface would be simpler when iterating through datasets and their names - one more point for making datasets aware of their configuration and providing API to get it.
This insight - https://github.com/kedro-org/kedro/issues/3929 also shows the necessity of datasets modification.

ElenaKhaustova commented 3 months ago

Some reflections after the tech design and thoughts on @deepyaman concerns and suggestions from here:

Questions/suggestions raised from tech design, for posterity:

Bit of a philosophical question—if this is a data catalog redesign/2.0, why shy away from the significant changes, instead of taking this opportunity to make them?

Feels like, if not now, then when?

It feels a bit backwards to start with AbstractDataCatalog abstraction without having a fuller understanding of what a second or third data catalog really looks like; maybe it makes sense to PoC these other data catalogs, then create the unifying abstraction?

You don’t necessarily need to create the AbstractDataCatalog in order to address the other changes.

What I think would be extremely helpful is to start with the PoC Unity Catalog to make the value more concrete/clear to the rest of us, who don’t understand.🙂

Like you say, there are already users who are creating their own catalog implementations of sorts, or extending the catalog; would love to see a hack of the Unity Catalog with Kedro, and then can see (1) the value and (2) how this abstraction can be best designed to support it.

If anything, having a hacked in Iceberg Catalog + Polaris Catalog too will help show that the abstraction is really correct and solving the goal.

There’s also some challenges with starting this early on the abstraction, like already deciding we will have AbstractDataCatalog. In another thread, @astrojuanlu raises question, have thought about having DataCatalogProtocol? This would help allow for standalone data catalog (which is part of goals in group 3). AbstractDataCatalog will cause an issue, same as AbstractDataSet currently forces dependency on Kedro.

First of all, thank you for this summary - we appreciate that people are interested in this workstream and care about the results.

We see concern about the decision to use AbstractDataCatalog rather than DataCatalogProtocol. We double-check how we can benefit from the second one, but as noted before, Unity Catalog PoC should answer this question as well.
Unity Catalog PoC is one of the further steps for us, but we do not start from it because the main goal is to address issues identified at the user research interview, but other catalog integrations go beyond this goal. In order to work on those workstreams in parallel it's convenient to have some abstraction helping to split different implementations, no matter whether it is AbstractDataCatalog or DataCatalogProtocol. Having one of them does not mean that we do not switch to another if we find it makes sense during the PoC. At the same time, we don't want to block the main workstream with integration experiments. And of course, we are not going to make any breaking changes until we are sure about the abstraction necessity and API. We need some time to showcase more concrete suggestions on other catalogs' implementations but we already know that they have to have some specific interface to be compatible with the rest of the framework as we are not going to rewrite the whole framework. From here we make the assumption how the interface should look but we admit it's changes.
Not sure I got the point about hiding from significant changes, we are suggesting them. If you mean some specific feature - "speak now or forever hold your peace" 🙂

deepyaman commented 3 months ago

Not sure I got the point about hiding from significant changes, we are suggesting them. If you mean some specific feature - "speak now or forever hold your peace" 🙂

https://github.com/kedro-org/kedro/issues/3941

As you mention, "There is sufficient user interest to justify making DataCatalog standalone."

At the very least, would like to see the ability to create and use a DataCatalog that does not depend on Kedro; right now, this is not possible, because of the DataCatalog subclass validator; AbstractDataCatalog further couples this.

ElenaKhaustova commented 2 months ago

Some follow-ups after the discussion with @astrojuanlu, @merelcht and @deepyaman:

At the very least, would like to see the ability to create and use a DataCatalog that does not depend on Kedro; right now, this is not possible, because of the DataCatalog subclass validator; AbstractDataCatalog further couples this.

We don't think the abstraction itself is a blocker for making DataCatalog a separate component, no matter whether we have it we still be able to move it with the implementation. The real problem is the dependency of kedro.io.core - https://github.com/kedro-org/kedro/blob/6c7a1cca9629d09a9051b0fd0a74c7c22ebd2f01/kedro/io/data_catalog.py#L18

There is also a different opinion on the idea of splitting kedro into the smaller set of libs here: https://github.com/kedro-org/kedro/issues/3659#issuecomment-2054167697

To sum up, we would like to keep this topic out of the discussion for now as a decision about the abstraction doesn't directly relate to the problem and it can be made later.

We see the value of using Protocol instead of Abstact only in case of moving pattern resolution logic out of the DataCatalog. Otherwise, to reuse the pattern logic we will have to explicitly declare that a certain class implements a given protocol as a regular base class - https://peps.python.org/pep-0544/#explicitly-declaring-implementation and we will lose an advantage of Protocol.
Moving pattern resolution logic out of the DataCatalog will simplify the overall logic and implementation of DataCatalog. At the same time, it allows untying from the catalog and datasets configuration logic, related to the framework. As a side effect, we can get a catalog with more loosely coupled architecture that will not share extensive mandatory logic, so it will be easier to follow Protocol concept and proceed with potential integrations.
We also think it makes sense to consider the idea of implementing Protocol for datasets to deal with the catalog dependency on kedro.io.core.

Given points 1, 2 and 3 we are going to:

Focus on moving pattern resolution logic outside of DataCatalog;
Make a decision on the abstraction after that, assuming it should be easier to go with Protocol;
Postpone the integration workstream until the above is resolved.

ElenaKhaustova commented 2 months ago

Focus on moving pattern resolution logic outside of DataCatalog;

The following ticket and PRs address this point from the above discussion:

https://github.com/kedro-org/kedro/issues/4110 - parent ticket
https://github.com/kedro-org/kedro/pull/4123 - implementation of KedroDataCatalog and DataCatalogConfigResolver
https://github.com/kedro-org/kedro/pull/4124 - context, session, runners and project cli update
https://github.com/kedro-org/kedro/pull/4130 - catalog cli update

Further steps suggested:

Discuss changes in the above PRs and make adjustments if required;
Merge the above PRs to the 3995-data-catalog-2.0 branch;
Move to Protocol abstraction for KedroDataCatalog;
Work on unit/e2e test for 3995-data-catalog-2.0 branch;
Merge 3995-data-catalog-2.0 to main to have two versions of catalog, non-breaking change;
Keep working on the rest of the features for KedroDataCatalog;
Work on integration PoCs;

ElenaKhaustova commented 2 months ago

After discussing the above with @merelcht and @idanov, it was decided to split the above work into a set of incremental changes, modifying the existing catalog class or extending the functionality by introducing an abstraction without breaking changes where possible. Then, plan the set of breaking changes and discuss them separately.

ElenaKhaustova commented 2 months ago

Some motivations behind the decision:

We don't want to keep two implementations as it pollutes the code with multiple if/else branches
As the catalog is a core component, it's important for us to apply changes to the existing implementation where possible
We want to re-design the refactored catalog so it's easier to discuss changes
We want to plan and discuss each breaking change and improvement separately in the corresponding PRs - the current strategy looked heavy for the reviewers to provide feedback
The incremental approach allows testing features and decreases the chance of significant bugs, which is marked as more important than solving everything at once
We want to merge small changes to the main
We want to focus on changing the implementation in the first place and then changing an API

Long story short: we prefer a longer path with incremental changes of the existing catalog to bumping new catalog

astrojuanlu commented 2 months ago

"A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system." —John Gall (1975) Systemantics: How Systems Really Work and How They Fail p. 71

Since I believe the DataCatalog touches on so many ongoing conversations, I took a stab at implementing some ideas and publishing them here:

https://github.com/astrojuanlu/kedro-catalog

hoping that they serve as inspiration.

This is a prototype hacked in a rush so it's not meant to be a full replacement of the current KedroCatalog. It tries to tackle several pain points highlighted in https://github.com/kedro-org/kedro/issues/3934 by starting from scratch. Some niceties:

Catalog items are lazily loaded #2829
Creating custom datasets is easier #1936
It gets trivially represented on REPLs #1721
Has public API to retrieve dataset objects https://github.com/kedro-org/kedro/issues/1778#issuecomment-1728079791
...which in turn have public properties #3929
No ABCs are needed because there's no shared logic, only Protocols #4138

The codebase is lean and makes heavy use of @dataclass and Pydantic models. I'm no software engineer so I'm not claiming it's well designed, but hopefully it's easy to understand (and therefore criticise).

Of course, it's tiny because it leaves lots of things out of the table. It critically does not support:

Versions. But, are they really needed? #4129, #2355
Credentials. But, shouldn't we rework them completely already? #3811
Mutability. But doesn't it send the wrong message? #2728
Universal filepaths through fsspec. But shouldn't we already move on? https://github.com/kedro-org/kedro/issues/4314, https://github.com/kedro-org/kedro-plugins/issues/271, https://github.com/kedro-org/kedro-plugins/issues/625

astrojuanlu commented 2 months ago

So I guess my real question is:

Are we confident that the incremental strategy allows us to tackle all these user pain points in a timely fashion, while also meeting backwards compatibility including features that we aren't sure we want to keep around?

idanov commented 2 months ago

I think the quote you shared is clearly hinting towards an answer - we already have a complex system, so unless we want to dismantle the whole complex functionality that Kedro offers, we'd be better off with incremental changes. Unless you are suggesting to redesign the whole of Kedro and go with Kedro 2.0, but I'd rather try to reach 1.0 first 😅

Nevertheless, the sketched out solution you've created definitely serves as nice inspiration and highlights some of the ideas already in circulation, namely employing protocols and dataclasses, which we should definitely drift towards. We should bear in mind that a lot can be achieved in non-breaking changes with a bit of creativity.

In fact, the path might actually end up being much shorter if we go the non-breaking road, it might just involve more frequent smaller steps rather than a big jump, which will inevitably end up being followed by patch fixes, bug fixes and corner cases that we hadn't foreseen.

merelcht commented 2 months ago

Are we confident that the incremental strategy allows us to tackle all these user pain points in a timely fashion, while also meeting backwards compatibility including features that we aren't sure we want to keep around?

The short answer to this: yes.

The long answer: the incremental approach isn't a change in implementation and the user pain points it will tackle but in how we will deliver it. The current POC PRs tackle a lot all at once, which makes it hard to review and test properly. This will ultimately mean a delay in shipping and lower confidence that it works as expected. So like @idanov says, this iterative approach will likely end up being shorter and allow us to deliver improvements bit by bit.

@ElenaKhaustova and I had another chat and the concrete next steps are:

Refactor resolution logic to be separate from the DataCatalog #3925 . This can then already be shipped if the time is right for a release.
Add a new catalog KedroDataCatalog (or whatever name we decide on), which also uses the resolution logic + addresses https://github.com/kedro-org/kedro/issues/3926, https://github.com/kedro-org/kedro/issues/3916 and https://github.com/kedro-org/kedro/issues/3931
Release that new catalog
Continue on #4138

ElenaKhaustova commented 2 months ago

Thank you, @astrojuanlu, for sharing your ideas and vision on the target for DataCatalog. I agree with most of them, and that's similar to what is planned. But we will try to make it incrementally, since there's an explicit push for that.

ElenaKhaustova commented 4 weeks ago

Further plan for `KedroDataCatalog`:

Merge #4175
Add KedroDataCatalog documentation explaining how to use it: https://github.com/kedro-org/kedro/issues/4237
Next features we plan to implement:
- https://github.com/kedro-org/kedro/issues/3935
- https://github.com/kedro-org/kedro/issues/3932
Define the breaking changes that need to be done and make them in the develop branch
Release new Kedro version with breaking changes

Further refactoring and redesign that requires breaking change

Remove version argument from all catalog methods - versioning should be done for per entire catalog not per method

The next two should be done together but we decided to postpone them for now.

Refactoring of CatalogConfigResolver - move out credentials resolver to the config component
Make credentials a resolver - https://github.com/kedro-org/kedro/issues/3811

Currently, runners depend on both datasets and catalog we want all framework components use catalog abstraction to work with datasets, so the following refactoring is needed:

Runners - should not modify datasets (parallel runner)
Parallel runner depend on MemoryDataset and SharedMemoryDataset
Make a decision on release(), exists(), confirm() - if they should be part of CatalogProtocol. Will they change with the runners refactoring?

Now we have two ways to configure catalog: from dataset configurations and objects. Since datasets do not store their configurations, there is no way to retrieve them from dataset objects at the catalog level. This blocks features like https://github.com/kedro-org/kedro/issues/3932. In future we will need to:

Define DatasetProtocol;
Make datasets store and retrieve their configuration;

kedro-org / kedro

Design `DataCatalog2.0` #3995

Description

Context

Related topics

Next steps

Value Proposition and Goals for DataCatalog Redesign

Strategy and Implementation Plan

Summary of Integration Proposal: Kedro DataCatalog with Unity Data Catalog

How We Plan to Integrate

Recommendation to Start with Databricks Python SDK via Databricks Notebooks

Challenges with Integration

Integration via Platform SDK and Databricks Notebook

General concerns regarding integration

Further plan for KedroDataCatalog:

Further refactoring and redesign that requires breaking change

Further plan for `KedroDataCatalog`: