Open merelcht opened 4 months ago
One push here - and it's already addressed some what in your excellent write up @merelcht, is the fact that we should look to delegate / integrate as much as we can.
The data catalog was the first part of Kedro ever built, it was leap forward for us in 2017 but the industry has matured so much in that time. We should always provide an accessible, novice user UX... but I think now is the time for interoperability.
DataCatalog2.0
design board: https://miro.com/app/board/uXjVKy77OMo=/?share_link_id=278933784087
Love the board @ElenaKhaustova there is an argument we should have all of those in PUML on the docs too 🤔 bit like #4013
The current DataCatalog
implementation effectively addresses basic tasks such as preventing users from hardcoding data sources and standardising I/O operations. However, our goal is to enhance and extend its capabilities to reduce maintenance burdens, integrate seamlessly with enterprise solutions, and leverage the strengths of existing tools like DeltaLake and UnityCatalog. By keeping the datasets-based solution while integrating with UnityCatalog, we aim to cover a broader range of use cases and improve flexibility and functionality. With this redesign work, we aim to:
Develop a flexible API and specific data catalogs, allowing switching between different catalog implementations and utilising their feature while keeping the current dataset-based solution with adjustments.
AbstractDataCatalog
Implement an AbstractDataCatalog
with common methods - exposed to framework: load()
, save()
, confirm()
, release()
, exists()
, shallow_copy()
, and __contains__()
(subject to change?). KedroDataCatalog
and UnityDataCatalog
.
Common methods will be exposed to the framework. Methods not supported by a specific catalog will raise a "Not supported for this type of catalog" exception. KedroDataCatalog
will keep the current dataset-based solution with the updates needed to address the pain points identified at the user research interview.Unity Catalog Integration Options:
Local Workflow After Integration:
UnityDataCatalog
, which encapsulates the open-source Unity Catalog API hosted locally.Remote Workflow After Integration:
UnityDataCatalog
, encapsulating the Unity Catalog API hosted on Databricks.After evaluating the Unity Catalog (open source) and Unity Catalog (Databricks) and their APIs, we recommend starting the integration using the Databricks Python SDK via Databricks notebooks.
Reasons for Recommendation:
UnityDataCatalog
can manage schemas, tables, volumes, and models. It will be used as a wrapper to align with Kedro API and "datasets" concept.Providing some context for "What is UnityCatalog" as I personally find their docs are very confusing. I think the main differentiation is more enterprise focus on governance/access control.
Unity Catalog is a unified governance solution for data and AI assets on Databricks. It is not just a metastore or data connector, but rather a comprehensive system that includes several key components:
- Metastore: Unity Catalog uses a metastore as the top-level container for metadata about data assets and permissions. This metastore is similar to but more advanced than traditional Hive metastores.
That's a large part of the RESTful API described above, it store metadata and connectors consume the metadata and act almost like a dblink
- Three-level namespace: Unity Catalog organizes data assets using a three-level namespace hierarchy: catalog > schema (database) > table/view/volume. This allows for better organization and governance of data assets.
- Access control: It provides centralized access control across Databricks workspaces, allowing administrators to define data access policies in one place.
Main feature for access management.
- Auditing and lineage: Unity Catalog automatically captures user-level audit logs and data lineage information.
I haven't seen too much about this
Data discovery: It includes features for tagging, documenting, and searching data assets to help users find the data they need.
Metastore + UI as a shop for data source.
- Support for multiple object types: Unity Catalog can manage metadata for tables, views, volumes, models, and functions.
- Flexible storage options: It allows configuring storage locations at the metastore, catalog, or schema level to meet various data storage requirements.
A bit similar to what fsspec
does to Kedro, but expand beyond remote storage (Volume). They also have models and tables
With that in mind, my questions are:
UnityCatalog
& DataCatalog
abstraction is on different level - as Kedro's DataCatalog are mainly DataConnector (i.e. how to load a parquet file) while UnityCatalog answer things like (i.e. where is table a.b.c -> s3/some_directory/abc.parquet), it doesn't have information about how a data source should be consumed.With that in mind, does it makes sense to focus on a subset/Databricks native workflow (Spark/Delta/pandas workflow).
I also wonder, should we put all the focus on UnityCatalog/other catalog? or should it be more around the API change for better interactive use case (i.e. using DataCatalog
in a notebook).
With that in mind, my questions are:
- What are the benefits integrating with UnityCatalog?
Does Kedro users need to interact with UnityCatalog? Databricks users today can already use UnityCatalog without any integration. As Spark communicate with catalog in the background, for end users they are just declaring the table name.
- Questions, in case of using pandas with UnityCatalog, what does it do exactly or is it merely reading data from some kind of remote storage? I can't find any meaningful example beyonds Spark/Delta in their docs.
- The
UnityCatalog
&DataCatalog
abstraction is on different level - as Kedro's DataCatalog are mainly DataConnector (i.e. how to load a parquet file) while UnityCatalog answer things like (i.e. where is table a.b.c -> s3/some_directory/abc.parquet), it doesn't have information about how a data source should be consumed.With that in mind, does it makes sense to focus on a subset/Databricks native workflow (Spark/Delta/pandas workflow).
I also wonder, should we put all the focus on UnityCatalog/other catalog? or should it be more around the API change for better interactive use case (i.e. using
DataCatalog
in a notebook).
For me, the value is not particularly in integrating with Unity Catalog
but in exploring ways to extend the current DataCatalog
concept with new approaches. Currently, maintaining custom data connectors introduces additional overhead and complexity. By leveraging existing mechanisms like Unity Catalog
, we can potentially simplify data access and reduce maintenance burdens. This doesn't mean we plan to completely move to this new approach; rather, we want to test its feasibility and value.
The Unity Catalog
and DataCatalog
abstractions indeed operate at different levels. However, Unity Catalog
includes abstractions such as tables, volumes, and models, which can be aligned with Kedro's DataCatalog abstractions. Our goal with the PoC is to verify this alignment. Unity Catalog
, particularly when accessed via the Python SDK in Databricks notebooks (see the above summary), provides a comprehensive set of features (tables, models, volumes) and a UI, making it a suitable candidate for our experiment. Given that many clients use Databricks for development and deployment, it makes sense to start here. Based on the PoC results, we can decide whether to pursue further integration with Unity Catalog, consider other catalogs, or stick with our current solution.
Our main focus remains on improving Kedro's DataCatalog based on insights from user research and interviews.
Thanks a lot for the extensive research on the Unity Catalog @ElenaKhaustova 🙏🏼
My only point is that we should not make anything that's specific to Databricks Unity Catalog (since it's a commercial system) and it's a bit too early to understand how the different dataframe libraries and compute engines will interact with such metastores https://github.com/unitycatalog/unitycatalog/discussions/208#discussioncomment-10208766. At least, now that Polaris has just been open sourced https://github.com/polaris-catalog/polaris/pull/2, we know that the Apache Iceberg REST API "won", so if anything we should take that REST API as the reference.
More questions from my side:
CatalogProtocol
rather than an AbstractCatalog
?_datasets
property in your summary https://github.com/kedro-org/kedro/issues/3995#issuecomment-2258346648 but don't we want to offer a public, documented way of iterating through the datasets?shallow_copy
?Thank you, @astrojuanlu!
I fully agree with your points about not tightening to specific catalogs. The truth is that we are still determining whether we want to integrate with UnityCatalog/Polaris or something else. The answer might change depending on how they develop in the near future. That's why we suggest focusing on improving Kedro's DataCatalog, a solution shaped by insights from user research and interviews, and treating the integration part as research with a PoC as a target result.
In order to work on those two goals in parallel, we plan to start with moving shared logic to the AbstractDataCatalog
and implementing KedroDataCatalog
with the following improvements: https://github.com/kedro-org/kedro/issues/3925, https://github.com/kedro-org/kedro/issues/3916, https://github.com/kedro-org/kedro/issues/3926, https://github.com/kedro-org/kedro/issues/3931. Once it's done, we'll be able to work on UnityDataCatalog
and add more complex features from https://github.com/kedro-org/kedro/issues/3934, such as serialization/deserialization, to KedroDataCatalog
.
Answering other questions:
CatalogProtocol
and will keep it in mind, but I suggested AbstractCatalog
as it seems there will be some shared logic, such as pattern resolution, which falls into the ABCs concept. PoC should show whether it's a good idea and what constraints we might get with the suggested approach._datasets
, shallow_copy
and other properties/methods I mentioned above - that's not the suggested interface, but it's a DataCatalog
interface used within session, runner and pipeline, meaning it will most likely remain in some way as we do not plan to rewrite the whole framework. However, it doesn't mean we do not move from _datasets
to datasets
and so on, so it can change, but its primary purpose will probably remain the same.I just want to say I love the direction this is going ❤️ , great work folks
We picked the following tickets: https://github.com/kedro-org/kedro/issues/3925, https://github.com/kedro-org/kedro/issues/3926, https://github.com/kedro-org/kedro/issues/3916 and https://github.com/kedro-org/kedro/issues/3931 as a starting point for the implementation of AbstractDataCatalog
and KedroDataCatalog
.
The following PRs include the drafts of AbstractDataCatalog
, KedroDataCatalog
and updated CLI logic.
AbstractDataCatalog
and KedroDataCatalog
, refactoring factory resolution logic and dataset access: https://github.com/kedro-org/kedro/pull/4070Mentioned PRs include a draft of the following:
AbstractDataCatalog
and KedroDataCatalog(AbstractDataCatalog)
AbstractDataCatalog
now supports instantiation from configuration and/or datasets via constructorAbstractDataCatalog
stores the configuration provided_get_dataset()
to resolve_patterns()
_dataset_patterns
and _default_patterns
now obtained from config at the __init__
resolved_ds_configs
property to store resolved datasets' configurations at the catalog leveladd()
method adds or replaces the dataset and its configurationadd_feed_dict()
renamed to add_from_dict()
_runtime_patterns
catalog field to keep the logic of processing dataset/default/runtime patterns clearshallow_copy()
method used to add extra_dataset_patterns at runtime, replaced it with dedicated - add_runtime_patterns()
method_FrozenDatasets
and access datasets as propertiesadd(replace=True)
KedroDataCatalog
mutable:datasets
property public not to encourage behaviour when users configure the catalog via modifying the datasets
dictionary_datasets
property remained protected, but public datasets
property was added, returning a deep copy of _datasets
while the setter is still not allowed; the same is applied to the _resolved_ds_configs
property_datasets
via the catalog.add()
methodAbstractDataCatalog
compatible with the current runners' implementation several methods - release()
, confirm()
and exists()
were kept as the part of interface. But they only have a meaningful implementation for KedroDataCatalog
Some explanations behind the decisions made:
from_config()
method to allow instantiation from config and/or datasets via the constructorAfter a brief discussion of changes made with @idanov and these features: https://github.com/kedro-org/kedro/issues/3935 and https://github.com/kedro-org/kedro/issues/3932 we would like to focus on the following topics:
catalog["dataset_name"]
gives the Dataset
object, but it can return the data instead as if we do catalog["dataset_name"].load()
. The last may simplify an interface by removing load()
and save()
methods.load
or save data
. We want to explore what should be the default behaviour. We consider adding a flat to enable/disable lazy loading and keep it disabled by default to avoid the case when the pipeline fails at the very end because some package is missing. However, we can consider automatically enabling it based on some events, such as pipeline slicing.DataCatalog
remains the same. With this approach, we stack one PR on top of another and will have to merge all of them at the end. Another approach is moving the changes proposed incrementally to the existing DataCatalog
, trying to make them non-breaking so users can try new features while the rest are in development.The following PR https://github.com/kedro-org/kedro/pull/4084 includes updates required to use both DataCatalog
and AbstractDataCatalog
to run the kedro project. It roughly shows the changes needed on the framework side to keep both versions working together.
We also tried an approach when moving the changes proposed incrementally to the existing DataCatalog
, trying to make them non-breaking so users can try new features while the rest are in development. This involved moving the existing DataCatalog
to the AbstractDataCatalog
so we can use catalogs inherited from AbstractDataCatalog
in the framework. But since the catalog initialisation has changed significantly in the AbstractDataCatalog,
it led to the overloading AbstractDataCatalog
with fields, parameters and methods needed only for DataCatalog
- mixing both implementations. So, it was decided that we should not go with this approach.
There's also an approach where we use the develop
branch to replace DataCatalog
with AbstractDataCatalog
straightaway and only merge to main
when it's finished and we make a breaking release. The biggest drawback is that it will require more time until the full completion and users won't be able to use the new catalog during the development.
Based on the above, the further suggested approach is:
KedroDataCatalog(AbstractDataCatalog)
that covers DataCatalog
KedroDataCatalog
UnityDataCatalog
(the expected result is PoC with working spaceflights pipelines with UnityDataCatalog
at Databricks) and, based on the results, decide regarding the AbstractDataCatalog
interface.KedroDataCatalog
after some time when we decide to do the breaking releaseOther things to take into consideration:
ParallelRunner
expects datasets to be mutable; Do we want to keep them mutable for ourselves but not for users?Some reflections after the tech design and thoughts on @deepyaman concerns and suggestions from here:
Questions/suggestions raised from tech design, for posterity:
Bit of a philosophical question—if this is a data catalog redesign/2.0, why shy away from the significant changes, instead of taking this opportunity to make them?
Feels like, if not now, then when?
It feels a bit backwards to start with AbstractDataCatalog abstraction without having a fuller understanding of what a second or third data catalog really looks like; maybe it makes sense to PoC these other data catalogs, then create the unifying abstraction?
You don’t necessarily need to create the AbstractDataCatalog in order to address the other changes.
What I think would be extremely helpful is to start with the PoC Unity Catalog to make the value more concrete/clear to the rest of us, who don’t understand.🙂
Like you say, there are already users who are creating their own catalog implementations of sorts, or extending the catalog; would love to see a hack of the Unity Catalog with Kedro, and then can see (1) the value and (2) how this abstraction can be best designed to support it.
If anything, having a hacked in Iceberg Catalog + Polaris Catalog too will help show that the abstraction is really correct and solving the goal.
There’s also some challenges with starting this early on the abstraction, like already deciding we will have AbstractDataCatalog. In another thread, @astrojuanlu raises question, have thought about having DataCatalogProtocol? This would help allow for standalone data catalog (which is part of goals in group 3). AbstractDataCatalog will cause an issue, same as AbstractDataSet currently forces dependency on Kedro.
First of all, thank you for this summary - we appreciate that people are interested in this workstream and care about the results.
AbstractDataCatalog
rather than DataCatalogProtocol
. We double-check how we can benefit from the second one, but as noted before, Unity Catalog PoC should answer this question as well.AbstractDataCatalog
or DataCatalogProtocol
. Having one of them does not mean that we do not switch to another if we find it makes sense during the PoC. At the same time, we don't want to block the main workstream with integration experiments. And of course, we are not going to make any breaking changes until we are sure about the abstraction necessity and API. We need some time to showcase more concrete suggestions on other catalogs' implementations but we already know that they have to have some specific interface to be compatible with the rest of the framework as we are not going to rewrite the whole framework. From here we make the assumption how the interface should look but we admit it's changes.
- Not sure I got the point about hiding from significant changes, we are suggesting them. If you mean some specific feature - "speak now or forever hold your peace" 🙂
https://github.com/kedro-org/kedro/issues/3941
As you mention, "There is sufficient user interest to justify making DataCatalog
standalone."
At the very least, would like to see the ability to create and use a DataCatalog
that does not depend on Kedro; right now, this is not possible, because of the DataCatalog
subclass validator; AbstractDataCatalog
further couples this.
Some follow-ups after the discussion with @astrojuanlu, @merelcht and @deepyaman:
At the very least, would like to see the ability to create and use a DataCatalog that does not depend on Kedro; right now, this is not possible, because of the DataCatalog subclass validator; AbstractDataCatalog further couples this.
We don't think the abstraction itself is a blocker for making DataCatalog
a separate component, no matter whether we have it we still be able to move it with the implementation. The real problem is the dependency of kedro.io.core
- https://github.com/kedro-org/kedro/blob/6c7a1cca9629d09a9051b0fd0a74c7c22ebd2f01/kedro/io/data_catalog.py#L18
There is also a different opinion on the idea of splitting kedro into the smaller set of libs here: https://github.com/kedro-org/kedro/issues/3659#issuecomment-2054167697
To sum up, we would like to keep this topic out of the discussion for now as a decision about the abstraction doesn't directly relate to the problem and it can be made later.
Protocol
instead of Abstact
only in case of moving pattern resolution logic out of the DataCatalog
. Otherwise, to reuse the pattern logic we will have to explicitly declare that a certain class implements a given protocol as a regular base class - https://peps.python.org/pep-0544/#explicitly-declaring-implementation and we will lose an advantage of Protocol
.DataCatalog
will simplify the overall logic and implementation of DataCatalog
. At the same time, it allows untying from the catalog and datasets configuration logic, related to the framework. As a side effect, we can get a catalog with more loosely coupled architecture that will not share extensive mandatory logic, so it will be easier to follow Protocol
concept and proceed with potential integrations.Protocol
for datasets to deal with the catalog dependency on kedro.io.core
.Given points 1, 2 and 3 we are going to:
DataCatalog
;Protocol
;Focus on moving pattern resolution logic outside of DataCatalog;
The following ticket and PRs address this point from the above discussion:
KedroDataCatalog
and DataCatalogConfigResolver
context
, session
, runners
and project cli
updatecatalog cli
updateFurther steps suggested:
Protocol
abstraction for KedroDataCatalog
;3995-data-catalog-2.0
to main
to have two versions of catalog, non-breaking change;KedroDataCatalog
;After discussing the above with @merelcht and @idanov, it was decided to split the above work into a set of incremental changes, modifying the existing catalog class or extending the functionality by introducing an abstraction without breaking changes where possible. Then, plan the set of breaking changes and discuss them separately.
Some motivations behind the decision:
Long story short: we prefer a longer path with incremental changes of the existing catalog to bumping new catalog
"A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system." —John Gall (1975) Systemantics: How Systems Really Work and How They Fail p. 71
Since I believe the DataCatalog
touches on so many ongoing conversations, I took a stab at implementing some ideas and publishing them here:
https://github.com/astrojuanlu/kedro-catalog
hoping that they serve as inspiration.
This is a prototype hacked in a rush so it's not meant to be a full replacement of the current KedroCatalog
. It tries to tackle several pain points highlighted in https://github.com/kedro-org/kedro/issues/3934 by starting from scratch. Some niceties:
The codebase is lean and makes heavy use of @dataclass
and Pydantic models. I'm no software engineer so I'm not claiming it's well designed, but hopefully it's easy to understand (and therefore criticise).
Of course, it's tiny because it leaves lots of things out of the table. It critically does not support:
So I guess my real question is:
Are we confident that the incremental strategy allows us to tackle all these user pain points in a timely fashion, while also meeting backwards compatibility including features that we aren't sure we want to keep around?
I think the quote you shared is clearly hinting towards an answer - we already have a complex system, so unless we want to dismantle the whole complex functionality that Kedro offers, we'd be better off with incremental changes. Unless you are suggesting to redesign the whole of Kedro and go with Kedro 2.0, but I'd rather try to reach 1.0 first 😅
Nevertheless, the sketched out solution you've created definitely serves as nice inspiration and highlights some of the ideas already in circulation, namely employing protocols and dataclasses, which we should definitely drift towards. We should bear in mind that a lot can be achieved in non-breaking changes with a bit of creativity.
In fact, the path might actually end up being much shorter if we go the non-breaking road, it might just involve more frequent smaller steps rather than a big jump, which will inevitably end up being followed by patch fixes, bug fixes and corner cases that we hadn't foreseen.
Are we confident that the incremental strategy allows us to tackle all these user pain points in a timely fashion, while also meeting backwards compatibility including features that we aren't sure we want to keep around?
The short answer to this: yes.
The long answer: the incremental approach isn't a change in implementation and the user pain points it will tackle but in how we will deliver it. The current POC PRs tackle a lot all at once, which makes it hard to review and test properly. This will ultimately mean a delay in shipping and lower confidence that it works as expected. So like @idanov says, this iterative approach will likely end up being shorter and allow us to deliver improvements bit by bit.
@ElenaKhaustova and I had another chat and the concrete next steps are:
DataCatalog
#3925 . This can then already be shipped if the time is right for a release. KedroDataCatalog
(or whatever name we decide on), which also uses the resolution logic + addresses https://github.com/kedro-org/kedro/issues/3926, https://github.com/kedro-org/kedro/issues/3916 and https://github.com/kedro-org/kedro/issues/3931Thank you, @astrojuanlu, for sharing your ideas and vision on the target for DataCatalog
. I agree with most of them, and that's similar to what is planned. But we will try to make it incrementally, since there's an explicit push for that.
KedroDataCatalog
:KedroDataCatalog
documentation explaining how to use it: https://github.com/kedro-org/kedro/issues/4237develop
branchThe next two should be done together but we decided to postpone them for now.
CatalogConfigResolver
- move out credentials resolver to the config componentCurrently, runners depend on both datasets and catalog we want all framework components use catalog abstraction to work with datasets, so the following refactoring is needed:
MemoryDataset
and SharedMemoryDataset
release()
, exists()
, confirm()
- if they should be part of CatalogProtocol
. Will they change with the runners refactoring?Now we have two ways to configure catalog: from dataset configurations and objects. Since datasets do not store their configurations, there is no way to retrieve them from dataset objects at the catalog level. This blocks features like https://github.com/kedro-org/kedro/issues/3932. In future we will need to:
DatasetProtocol
;
Description
The current
DataCatalog
in Kedro has served its purpose well but has limitations and areas for improvement identified through user research: https://github.com/kedro-org/kedro/issues/3934As a result of
DataCatalog
user research interview we have created the list of tickets and split them into 3 categories:Addressing issues from 2. and 3. requires significant changes and the introduction of new features and concepts that go beyond the scope of incremental updates.
The objective is to design a new, robust, and modular
DataCatalog2.0
(a better name is welcomed) that incorporates feedback from the community, follows best practices, and integrates new features seamlessly.While redesigning we plan for a smooth migration from the current
DataCatalog
toDataCatalog2.0
, minimizing disruption for existing user.Context
Suggested prioritisation and tickets opened: https://github.com/kedro-org/kedro/issues/3934#issuecomment-2153342972
Related topics
https://github.com/kedro-org/kedro-starters/tree/main/standalone-datacatalog https://github.com/kedro-org/kedro/issues/2901 https://github.com/kedro-org/kedro/issues/2741
Next steps
DataCatalog
architecture.Unity Catalog
,Polaris
,dlthub
, other ?) address similar tasks and challenges.DataCatalog2.0
.