kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10k stars 905 forks source link

Design `DataCatalog2.0` #3995

Open merelcht opened 4 months ago

merelcht commented 4 months ago

Description

The current DataCatalog in Kedro has served its purpose well but has limitations and areas for improvement identified through user research: https://github.com/kedro-org/kedro/issues/3934

As a result of DataCatalog user research interview we have created the list of tickets and split them into 3 categories:

  1. Tickets that require low or medium effort and can be implemented for the current DataCatalog version without introducing any breaking changes. (partially addressed)
  2. Tickets with all effort/impact levels that should be implemented for the new catalog version - DataCatalog2.0. These tickets include significant catalog redesign steps before the implementation.
  3. Tickets mostly with a high effort level which didn't fall into the previous item because they are touching some significant conceptual changes or new features that require a decision on whether we want to implement them.

Addressing issues from 2. and 3. requires significant changes and the introduction of new features and concepts that go beyond the scope of incremental updates.

The objective is to design a new, robust, and modular DataCatalog2.0 (a better name is welcomed) that incorporates feedback from the community, follows best practices, and integrates new features seamlessly.

While redesigning we plan for a smooth migration from the current DataCatalog to DataCatalog2.0, minimizing disruption for existing user.

Context

Suggested prioritisation and tickets opened: https://github.com/kedro-org/kedro/issues/3934#issuecomment-2153342972

Related topics

https://github.com/kedro-org/kedro-starters/tree/main/standalone-datacatalog https://github.com/kedro-org/kedro/issues/2901 https://github.com/kedro-org/kedro/issues/2741

Next steps

  1. Create a detailed diagram of the current DataCatalog architecture.
  2. Identify components, their interactions, and existing data flows.
  3. Highlight pain points and limitations identified during user research.
  1. Research and document how similar projects (such as Unity Catalog, Polaris, dlthub, other ?) address similar tasks and challenges.
  2. Identify features, implementations, and best practices from these projects and map them with features requested/insights obtained during user research.
  3. Analyze their architectural approaches and note down pros and cons.
datajoely commented 4 months ago

One push here - and it's already addressed some what in your excellent write up @merelcht, is the fact that we should look to delegate / integrate as much as we can.

The data catalog was the first part of Kedro ever built, it was leap forward for us in 2017 but the industry has matured so much in that time. We should always provide an accessible, novice user UX... but I think now is the time for interoperability.

ElenaKhaustova commented 4 months ago

DataCatalog2.0 design board: https://miro.com/app/board/uXjVKy77OMo=/?share_link_id=278933784087

datajoely commented 4 months ago

Love the board @ElenaKhaustova there is an argument we should have all of those in PUML on the docs too 🤔 bit like #4013

ElenaKhaustova commented 3 months ago

Value Proposition and Goals for DataCatalog Redesign

The current DataCatalog implementation effectively addresses basic tasks such as preventing users from hardcoding data sources and standardising I/O operations. However, our goal is to enhance and extend its capabilities to reduce maintenance burdens, integrate seamlessly with enterprise solutions, and leverage the strengths of existing tools like DeltaLake and UnityCatalog. By keeping the datasets-based solution while integrating with UnityCatalog, we aim to cover a broader range of use cases and improve flexibility and functionality. With this redesign work, we aim to:

Strategy and Implementation Plan

Develop a flexible API and specific data catalogs, allowing switching between different catalog implementations and utilising their feature while keeping the current dataset-based solution with adjustments.

  1. Design AbstractDataCatalog Implement an AbstractDataCatalog with common methods - exposed to framework: load(), save(), confirm(), release(), exists(), shallow_copy(), and __contains__() (subject to change?).
  2. Create specific implementations for KedroDataCatalog and UnityDataCatalog. Common methods will be exposed to the framework. Methods not supported by a specific catalog will raise a "Not supported for this type of catalog" exception. KedroDataCatalog will keep the current dataset-based solution with the updates needed to address the pain points identified at the user research interview.
  3. Define key features for Integration Learn the UnityCatalog API and how it aligns with its open-sourced part. Identify the top UnityCatalog features that will provide the most value for integration for POC.
Screenshot 2024-07-22 at 12 50 52
ElenaKhaustova commented 3 months ago

Summary of Integration Proposal: Kedro DataCatalog with Unity Data Catalog

How We Plan to Integrate

Unity Catalog Integration Options:

  1. Unity Catalog (Open Source): An open-source version of Unity Catalog accessible via its API.
  2. Unity Catalog (Databricks): The Databricks-hosted version of Unity Catalog, offering enhanced features and accessible via multiple APIs:

Screenshot 2024-07-30 at 13 45 30

Local Workflow After Integration:

Remote Workflow After Integration:

Recommendation to Start with Databricks Python SDK via Databricks Notebooks

After evaluating the Unity Catalog (open source) and Unity Catalog (Databricks) and their APIs, we recommend starting the integration using the Databricks Python SDK via Databricks notebooks.

Reasons for Recommendation:

  1. Limited Functionality of REST API and Databricks CLI: These interfaces provide table metadata but do not support content retrieval.
  2. Limited Functionality of Open Source Unity Catalog: The REST API for the open-source version is less capable compared to Databricks' REST API, with most examples available in Scala, SQL, or CLI. Elementary operations such as uploading/downloading data or model tracking lack clarity and support.
  3. Flexibility and Maturity of Python SDK:
    • The Python SDK running in Databricks notebooks provides comprehensive functionality for working with tables, volumes, and models.
    • The Python SDK used locally does not offer the same level of functionality https://github.com/databricks/databricks-sdk-py.
    • It is more mature, reducing the risk of breaking changes, and covers most methods exposed by the current Kedro DataCatalog.

Challenges with Integration

  1. Limitations of REST API and Databricks CLI: These interfaces return metadata but not the actual content of tables.
    • Unity Catalog API’s REST API provides table information but lacks a mechanism to query table data directly.
    • There are ongoing discussions about adding such capabilities, but they are not yet available. source 1, source 2
    • While the REST API allows uploading/downloading unstructured data as catalog volumes or via DBFS, but loading tables requires additional mechanisms, similar to current data connectors - so the value of replacing the existing solution with Unity Data catalog, in this case, is unclear.

Screenshot 2024-07-30 at 13 45 53 Screenshot 2024-07-30 at 13 46 41

  1. Limited Functionality of Open Source Unity Catalog:

Screenshot 2024-07-30 at 13 47 10

Integration via Platform SDK and Databricks Notebook

Screenshot 2024-07-30 at 13 47 47

General concerns regarding integration

noklam commented 3 months ago

Providing some context for "What is UnityCatalog" as I personally find their docs are very confusing. I think the main differentiation is more enterprise focus on governance/access control.

Unity Catalog is a unified governance solution for data and AI assets on Databricks. It is not just a metastore or data connector, but rather a comprehensive system that includes several key components:

  • Metastore: Unity Catalog uses a metastore as the top-level container for metadata about data assets and permissions. This metastore is similar to but more advanced than traditional Hive metastores.

That's a large part of the RESTful API described above, it store metadata and connectors consume the metadata and act almost like a dblink

  • Three-level namespace: Unity Catalog organizes data assets using a three-level namespace hierarchy: catalog > schema (database) > table/view/volume. This allows for better organization and governance of data assets.
  • Access control: It provides centralized access control across Databricks workspaces, allowing administrators to define data access policies in one place.

Main feature for access management.

  • Auditing and lineage: Unity Catalog automatically captures user-level audit logs and data lineage information.

I haven't seen too much about this

Data discovery: It includes features for tagging, documenting, and searching data assets to help users find the data they need.

Metastore + UI as a shop for data source.

  • Support for multiple object types: Unity Catalog can manage metadata for tables, views, volumes, models, and functions.
  • Flexible storage options: It allows configuring storage locations at the metastore, catalog, or schema level to meet various data storage requirements.

A bit similar to what fsspec does to Kedro, but expand beyond remote storage (Volume). They also have models and tables

noklam commented 3 months ago

With that in mind, my questions are:

With that in mind, does it makes sense to focus on a subset/Databricks native workflow (Spark/Delta/pandas workflow).

I also wonder, should we put all the focus on UnityCatalog/other catalog? or should it be more around the API change for better interactive use case (i.e. using DataCatalog in a notebook).

ElenaKhaustova commented 3 months ago

With that in mind, my questions are:

  • What are the benefits integrating with UnityCatalog?
  • Does Kedro users need to interact with UnityCatalog? Databricks users today can already use UnityCatalog without any integration. As Spark communicate with catalog in the background, for end users they are just declaring the table name.

    • Questions, in case of using pandas with UnityCatalog, what does it do exactly or is it merely reading data from some kind of remote storage? I can't find any meaningful example beyonds Spark/Delta in their docs.
    • The UnityCatalog & DataCatalog abstraction is on different level - as Kedro's DataCatalog are mainly DataConnector (i.e. how to load a parquet file) while UnityCatalog answer things like (i.e. where is table a.b.c -> s3/some_directory/abc.parquet), it doesn't have information about how a data source should be consumed.

With that in mind, does it makes sense to focus on a subset/Databricks native workflow (Spark/Delta/pandas workflow).

I also wonder, should we put all the focus on UnityCatalog/other catalog? or should it be more around the API change for better interactive use case (i.e. using DataCatalog in a notebook).

For me, the value is not particularly in integrating with Unity Catalog but in exploring ways to extend the current DataCatalog concept with new approaches. Currently, maintaining custom data connectors introduces additional overhead and complexity. By leveraging existing mechanisms like Unity Catalog, we can potentially simplify data access and reduce maintenance burdens. This doesn't mean we plan to completely move to this new approach; rather, we want to test its feasibility and value.

The Unity Catalog and DataCatalog abstractions indeed operate at different levels. However, Unity Catalog includes abstractions such as tables, volumes, and models, which can be aligned with Kedro's DataCatalog abstractions. Our goal with the PoC is to verify this alignment. Unity Catalog, particularly when accessed via the Python SDK in Databricks notebooks (see the above summary), provides a comprehensive set of features (tables, models, volumes) and a UI, making it a suitable candidate for our experiment. Given that many clients use Databricks for development and deployment, it makes sense to start here. Based on the PoC results, we can decide whether to pursue further integration with Unity Catalog, consider other catalogs, or stick with our current solution.

Our main focus remains on improving Kedro's DataCatalog based on insights from user research and interviews.

astrojuanlu commented 3 months ago

Thanks a lot for the extensive research on the Unity Catalog @ElenaKhaustova 🙏🏼

My only point is that we should not make anything that's specific to Databricks Unity Catalog (since it's a commercial system) and it's a bit too early to understand how the different dataframe libraries and compute engines will interact with such metastores https://github.com/unitycatalog/unitycatalog/discussions/208#discussioncomment-10208766. At least, now that Polaris has just been open sourced https://github.com/polaris-catalog/polaris/pull/2, we know that the Apache Iceberg REST API "won", so if anything we should take that REST API as the reference.

More questions from my side:

ElenaKhaustova commented 3 months ago

Thank you, @astrojuanlu!

I fully agree with your points about not tightening to specific catalogs. The truth is that we are still determining whether we want to integrate with UnityCatalog/Polaris or something else. The answer might change depending on how they develop in the near future. That's why we suggest focusing on improving Kedro's DataCatalog, a solution shaped by insights from user research and interviews, and treating the integration part as research with a PoC as a target result.

In order to work on those two goals in parallel, we plan to start with moving shared logic to the AbstractDataCatalog and implementing KedroDataCatalog with the following improvements: https://github.com/kedro-org/kedro/issues/3925, https://github.com/kedro-org/kedro/issues/3916, https://github.com/kedro-org/kedro/issues/3926, https://github.com/kedro-org/kedro/issues/3931. Once it's done, we'll be able to work on UnityDataCatalog and add more complex features from https://github.com/kedro-org/kedro/issues/3934, such as serialization/deserialization, to KedroDataCatalog.

Answering other questions:

datajoely commented 3 months ago

I just want to say I love the direction this is going ❤️ , great work folks

ElenaKhaustova commented 3 months ago

We picked the following tickets: https://github.com/kedro-org/kedro/issues/3925, https://github.com/kedro-org/kedro/issues/3926, https://github.com/kedro-org/kedro/issues/3916 and https://github.com/kedro-org/kedro/issues/3931 as a starting point for the implementation of AbstractDataCatalog and KedroDataCatalog.

The following PRs include the drafts of AbstractDataCatalog, KedroDataCatalog and updated CLI logic.

Mentioned PRs include a draft of the following:

  1. Implement draft of AbstractDataCatalog and KedroDataCatalog(AbstractDataCatalog)
  1. Rework dataset pattern resolution logic:
  1. Rework dataset access logic
  1. Make KedroDataCatalog mutable:
  1. To make AbstractDataCatalog compatible with the current runners' implementation several methods - release(), confirm() and exists() were kept as the part of interface. But they only have a meaningful implementation for KedroDataCatalog

Some explanations behind the decisions made:

  1. Split resolution logic from datasets init gives the flexibility to resolve configurations without creating datasets and is required for the lazy loading feature
  2. We now store datasets' configurations for to_dict()/from_dict() feature
  3. We removed from_config() method to allow instantiation from config and/or datasets via the constructor

After a brief discussion of changes made with @idanov and these features: https://github.com/kedro-org/kedro/issues/3935 and https://github.com/kedro-org/kedro/issues/3932 we would like to focus on the following topics:

  1. Do we need datasets property? After the changes were made, the value of dataset property became unclear, so we want to consider removing it. The alternative discussed is to change a key access interface to work with data but not the datasets. Now catalog["dataset_name"] gives the Dataset object, but it can return the data instead as if we do catalog["dataset_name"].load(). The last may simplify an interface by removing load() and save() methods.
  2. Datasets require corresponding changes to store configuration or the ability to recover it from the object for proper implementation of to_dict()/from_dict() feature. Currently, if catalog was instantiated from datasets, we do not know their configuration.
  3. Lazy loading. Since resolution logic is now decoupled from instantiating logic we can postpone the actual datasets instantiation and only make it when load or save data. We want to explore what should be the default behaviour. We consider adding a flat to enable/disable lazy loading and keep it disabled by default to avoid the case when the pipeline fails at the very end because some package is missing. However, we can consider automatically enabling it based on some events, such as pipeline slicing.
  4. How do we want to apply the proposed changes? In the opened PRs, all the changes are implemented for the separate components so that the old DataCatalog remains the same. With this approach, we stack one PR on top of another and will have to merge all of them at the end. Another approach is moving the changes proposed incrementally to the existing DataCatalog, trying to make them non-breaking so users can try new features while the rest are in development.
ElenaKhaustova commented 3 months ago

The following PR https://github.com/kedro-org/kedro/pull/4084 includes updates required to use both DataCatalog and AbstractDataCatalog to run the kedro project. It roughly shows the changes needed on the framework side to keep both versions working together.

We also tried an approach when moving the changes proposed incrementally to the existing DataCatalog, trying to make them non-breaking so users can try new features while the rest are in development. This involved moving the existing DataCatalog to the AbstractDataCatalog so we can use catalogs inherited from AbstractDataCatalog in the framework. But since the catalog initialisation has changed significantly in the AbstractDataCatalog, it led to the overloading AbstractDataCatalog with fields, parameters and methods needed only for DataCatalog - mixing both implementations. So, it was decided that we should not go with this approach.

There's also an approach where we use the develop branch to replace DataCatalog with AbstractDataCatalog straightaway and only merge to main when it's finished and we make a breaking release. The biggest drawback is that it will require more time until the full completion and users won't be able to use the new catalog during the development.

Based on the above, the further suggested approach is:

  1. Keep both implementations together so that users can switch between them and use both of them - it will require some extra implementation effort but allow testing the new catalog and keeping it separate from the old one;
  2. Implement basic functionality for KedroDataCatalog(AbstractDataCatalog) that covers DataCatalog
  3. Keep adding features for KedroDataCatalog
  4. Run experiments with UnityDataCatalog (the expected result is PoC with working spaceflights pipelines with UnityDataCatalog at Databricks) and, based on the results, decide regarding the AbstractDataCatalog interface.
  5. Completely switch to KedroDataCatalog after some time when we decide to do the breaking release

Other things to take into consideration:

  1. ParallelRunner expects datasets to be mutable; Do we want to keep them mutable for ourselves but not for users?
  2. If we store dataset names on the dataset side the data catalog interface would be simpler when iterating through datasets and their names - one more point for making datasets aware of their configuration and providing API to get it.
  3. This insight - https://github.com/kedro-org/kedro/issues/3929 also shows the necessity of datasets modification.
ElenaKhaustova commented 3 months ago

Some reflections after the tech design and thoughts on @deepyaman concerns and suggestions from here:

Questions/suggestions raised from tech design, for posterity:

Bit of a philosophical question—if this is a data catalog redesign/2.0, why shy away from the significant changes, instead of taking this opportunity to make them?

Feels like, if not now, then when?

It feels a bit backwards to start with AbstractDataCatalog abstraction without having a fuller understanding of what a second or third data catalog really looks like; maybe it makes sense to PoC these other data catalogs, then create the unifying abstraction?

You don’t necessarily need to create the AbstractDataCatalog in order to address the other changes.

What I think would be extremely helpful is to start with the PoC Unity Catalog to make the value more concrete/clear to the rest of us, who don’t understand.🙂

Like you say, there are already users who are creating their own catalog implementations of sorts, or extending the catalog; would love to see a hack of the Unity Catalog with Kedro, and then can see (1) the value and (2) how this abstraction can be best designed to support it.

If anything, having a hacked in Iceberg Catalog + Polaris Catalog too will help show that the abstraction is really correct and solving the goal.

There’s also some challenges with starting this early on the abstraction, like already deciding we will have AbstractDataCatalog. In another thread, @astrojuanlu raises question, have thought about having DataCatalogProtocol? This would help allow for standalone data catalog (which is part of goals in group 3). AbstractDataCatalog will cause an issue, same as AbstractDataSet currently forces dependency on Kedro.

First of all, thank you for this summary - we appreciate that people are interested in this workstream and care about the results.

  1. We see concern about the decision to use AbstractDataCatalog rather than DataCatalogProtocol. We double-check how we can benefit from the second one, but as noted before, Unity Catalog PoC should answer this question as well.
  2. Unity Catalog PoC is one of the further steps for us, but we do not start from it because the main goal is to address issues identified at the user research interview, but other catalog integrations go beyond this goal. In order to work on those workstreams in parallel it's convenient to have some abstraction helping to split different implementations, no matter whether it is AbstractDataCatalog or DataCatalogProtocol. Having one of them does not mean that we do not switch to another if we find it makes sense during the PoC. At the same time, we don't want to block the main workstream with integration experiments. And of course, we are not going to make any breaking changes until we are sure about the abstraction necessity and API. We need some time to showcase more concrete suggestions on other catalogs' implementations but we already know that they have to have some specific interface to be compatible with the rest of the framework as we are not going to rewrite the whole framework. From here we make the assumption how the interface should look but we admit it's changes.
  3. Not sure I got the point about hiding from significant changes, we are suggesting them. If you mean some specific feature - "speak now or forever hold your peace" 🙂
deepyaman commented 3 months ago
  1. Not sure I got the point about hiding from significant changes, we are suggesting them. If you mean some specific feature - "speak now or forever hold your peace" 🙂

https://github.com/kedro-org/kedro/issues/3941

As you mention, "There is sufficient user interest to justify making DataCatalog standalone."

At the very least, would like to see the ability to create and use a DataCatalog that does not depend on Kedro; right now, this is not possible, because of the DataCatalog subclass validator; AbstractDataCatalog further couples this.

ElenaKhaustova commented 2 months ago

Some follow-ups after the discussion with @astrojuanlu, @merelcht and @deepyaman:

  1. At the very least, would like to see the ability to create and use a DataCatalog that does not depend on Kedro; right now, this is not possible, because of the DataCatalog subclass validator; AbstractDataCatalog further couples this.

We don't think the abstraction itself is a blocker for making DataCatalog a separate component, no matter whether we have it we still be able to move it with the implementation. The real problem is the dependency of kedro.io.core - https://github.com/kedro-org/kedro/blob/6c7a1cca9629d09a9051b0fd0a74c7c22ebd2f01/kedro/io/data_catalog.py#L18

There is also a different opinion on the idea of splitting kedro into the smaller set of libs here: https://github.com/kedro-org/kedro/issues/3659#issuecomment-2054167697

To sum up, we would like to keep this topic out of the discussion for now as a decision about the abstraction doesn't directly relate to the problem and it can be made later.

  1. We see the value of using Protocol instead of Abstact only in case of moving pattern resolution logic out of the DataCatalog. Otherwise, to reuse the pattern logic we will have to explicitly declare that a certain class implements a given protocol as a regular base class - https://peps.python.org/pep-0544/#explicitly-declaring-implementation and we will lose an advantage of Protocol.
  2. Moving pattern resolution logic out of the DataCatalog will simplify the overall logic and implementation of DataCatalog. At the same time, it allows untying from the catalog and datasets configuration logic, related to the framework. As a side effect, we can get a catalog with more loosely coupled architecture that will not share extensive mandatory logic, so it will be easier to follow Protocol concept and proceed with potential integrations.
  3. We also think it makes sense to consider the idea of implementing Protocol for datasets to deal with the catalog dependency on kedro.io.core.

Given points 1, 2 and 3 we are going to:

ElenaKhaustova commented 2 months ago

Focus on moving pattern resolution logic outside of DataCatalog;

The following ticket and PRs address this point from the above discussion:

Further steps suggested:

  1. Discuss changes in the above PRs and make adjustments if required;
  2. Merge the above PRs to the 3995-data-catalog-2.0 branch;
  3. Move to Protocol abstraction for KedroDataCatalog;
  4. Work on unit/e2e test for 3995-data-catalog-2.0 branch;
  5. Merge 3995-data-catalog-2.0 to main to have two versions of catalog, non-breaking change;
  6. Keep working on the rest of the features for KedroDataCatalog;
  7. Work on integration PoCs;
ElenaKhaustova commented 2 months ago

After discussing the above with @merelcht and @idanov, it was decided to split the above work into a set of incremental changes, modifying the existing catalog class or extending the functionality by introducing an abstraction without breaking changes where possible. Then, plan the set of breaking changes and discuss them separately.

ElenaKhaustova commented 2 months ago

Some motivations behind the decision:

  1. We don't want to keep two implementations as it pollutes the code with multiple if/else branches
  2. As the catalog is a core component, it's important for us to apply changes to the existing implementation where possible
  3. We want to re-design the refactored catalog so it's easier to discuss changes
  4. We want to plan and discuss each breaking change and improvement separately in the corresponding PRs - the current strategy looked heavy for the reviewers to provide feedback
  5. The incremental approach allows testing features and decreases the chance of significant bugs, which is marked as more important than solving everything at once
  6. We want to merge small changes to the main
  7. We want to focus on changing the implementation in the first place and then changing an API

Long story short: we prefer a longer path with incremental changes of the existing catalog to bumping new catalog

astrojuanlu commented 2 months ago

"A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system." —John Gall (1975) Systemantics: How Systems Really Work and How They Fail p. 71

Since I believe the DataCatalog touches on so many ongoing conversations, I took a stab at implementing some ideas and publishing them here:

https://github.com/astrojuanlu/kedro-catalog

hoping that they serve as inspiration.

This is a prototype hacked in a rush so it's not meant to be a full replacement of the current KedroCatalog. It tries to tackle several pain points highlighted in https://github.com/kedro-org/kedro/issues/3934 by starting from scratch. Some niceties:

The codebase is lean and makes heavy use of @dataclass and Pydantic models. I'm no software engineer so I'm not claiming it's well designed, but hopefully it's easy to understand (and therefore criticise).

Of course, it's tiny because it leaves lots of things out of the table. It critically does not support:

astrojuanlu commented 2 months ago

So I guess my real question is:

Are we confident that the incremental strategy allows us to tackle all these user pain points in a timely fashion, while also meeting backwards compatibility including features that we aren't sure we want to keep around?

idanov commented 2 months ago

I think the quote you shared is clearly hinting towards an answer - we already have a complex system, so unless we want to dismantle the whole complex functionality that Kedro offers, we'd be better off with incremental changes. Unless you are suggesting to redesign the whole of Kedro and go with Kedro 2.0, but I'd rather try to reach 1.0 first 😅

Nevertheless, the sketched out solution you've created definitely serves as nice inspiration and highlights some of the ideas already in circulation, namely employing protocols and dataclasses, which we should definitely drift towards. We should bear in mind that a lot can be achieved in non-breaking changes with a bit of creativity.

In fact, the path might actually end up being much shorter if we go the non-breaking road, it might just involve more frequent smaller steps rather than a big jump, which will inevitably end up being followed by patch fixes, bug fixes and corner cases that we hadn't foreseen.

merelcht commented 2 months ago

Are we confident that the incremental strategy allows us to tackle all these user pain points in a timely fashion, while also meeting backwards compatibility including features that we aren't sure we want to keep around?

The short answer to this: yes.

The long answer: the incremental approach isn't a change in implementation and the user pain points it will tackle but in how we will deliver it. The current POC PRs tackle a lot all at once, which makes it hard to review and test properly. This will ultimately mean a delay in shipping and lower confidence that it works as expected. So like @idanov says, this iterative approach will likely end up being shorter and allow us to deliver improvements bit by bit.

@ElenaKhaustova and I had another chat and the concrete next steps are:

ElenaKhaustova commented 2 months ago

Thank you, @astrojuanlu, for sharing your ideas and vision on the target for DataCatalog. I agree with most of them, and that's similar to what is planned. But we will try to make it incrementally, since there's an explicit push for that.

ElenaKhaustova commented 4 weeks ago

Further plan for KedroDataCatalog:

  1. Merge #4175
  2. Add KedroDataCatalog documentation explaining how to use it: https://github.com/kedro-org/kedro/issues/4237
  3. Next features we plan to implement:
  4. Define the breaking changes that need to be done and make them in the develop branch
  5. Release new Kedro version with breaking changes

Further refactoring and redesign that requires breaking change

The next two should be done together but we decided to postpone them for now.

Currently, runners depend on both datasets and catalog we want all framework components use catalog abstraction to work with datasets, so the following refactoring is needed:

Now we have two ways to configure catalog: from dataset configurations and objects. Since datasets do not store their configurations, there is no way to retrieve them from dataset objects at the catalog level. This blocks features like https://github.com/kedro-org/kedro/issues/3932. In future we will need to: