Enable adding new attributes to datasets

merelcht commented 1 year ago

Description

Implement the feature to allow users to add new attributes to datasets. Use the syntax decided in #2439

Verify that this works for the datasets preview: https://github.com/kedro-org/kedro-viz/issues/907
The extra attributes will be consumed by plugins (kedro-viz, kedro-mlflow etc) and not on the Kedro backend

Context

Sub-task of https://github.com/kedro-org/kedro/issues/1076

To be done after: Decide on syntax to allow adding new attributes#2439

Open question

Where should the metadata (or other name) attribute go?

On the individual datasets
On the AbstractDataSet
On the DataCatalog like layers?

merelcht commented 1 year ago

I think the handling of the metadata attribute should happen on the individual datasets. Any processing of the info in the metadata will be done by external plugins and not the Kedro framework itself.

The only complication here is that layers are processed inside the DataCatalog. It used to be on the individual datasets as well, but was moved to the catalog later on: https://github.com/quantumblacklabs/private-kedro/pull/548/files

If layers needs to go under metadata like this:

my_dataset:
       ....
       metadata:
            kedro-viz:
                 layer: raw

We'll need to ensure they're still processed correctly and also make sure the implementation is backwards compatible with having layers outside the metadata key.

AhdraMeraliQB commented 1 year ago

After discussion with @merelcht and @AntonyMilneQB the implementation should be as follows:

metadata, being a top-level key, will be introduced as an attribute on the individual datasets
The layer attribute will move to within metadata -> kedro-viz and will no longer be handled by Kedro's DataCatalog - reversing the decision made as part of https://github.com/quantumblacklabs/private-kedro/pull/548. As a result the logic for resolving this will need to move to kedro-viz (as currently they are the consumers of layer, not Kedro itself)
Dropping layer as a top-level key will be a breaking change forcing the implementation to happen in two stages: the first handling the nested layer on Viz, adding deprecation warnings to the top-level layer (non-breaking), and the second being the removal of the top-level layer entirely

These require further subtasks that I will define in separate issues, but I would like to get @idanov's and @rashidakanchwala's opinions here first.

The order of tasks would look like this:

Add metadata attribute to datasets
Add logic to Viz to check for layer from metadata->kedro-viz->layer before catalog.layer & move over this check
Add deprecation warnings to the top-level layer attribute on Kedro framework
- should the deprecation warnings be triggered just whenever a top-level layer is defined, or upon both definition and access?
Remove the top-level layer attribute (breaking) and the logic introduced to Viz in step 2

Questions to consider:

Why was layer moved to the DataCatlog originally (in the ticket above)? Is there something that hasn't been addressed?
Are there any oversights on moving the handling of the layer attribute to Viz?
Are there any other users of the top-levellayer attribute?

An alternative approach:

Alternatively, we could just keep layer where it is and not process it if defined within the metadata. This doesn't address the DataCatalog processing an attribute that isn't used within Kedro, but allows for the introduction of the metadata attribute to be made without any additional implementation on the Kedro framework.

rashidakanchwala commented 1 year ago

Questions to consider: - Why was layer moved to the DataCatlog originally (in the ticket above)? Is there something that hasn't been addressed? This is just my assumption but layers have been on Kedro-viz since 2020 and prior to this there was no other extra information that was passed in catalog from kedro project to kedro-viz, that's why probably there was no need to have it nested? - not a great idea for sure.

Are there any oversights on moving the handling of the layer attribute to Viz? Let me revert on this.

Are there any other users of the top-level layer attribute? I doubt, it seems very specific to Kedro-viz.

rashidakanchwala commented 1 year ago

Currently the layer logic resides in Kedro Framework in DataCatalog (https://github.com/kedro-org/kedro/blob/main/kedro/io/data_catalog.py#L270-L275). Kedro basically sends to kedro-viz - all the layers as a dict in the format below

dict_items([('raw', {'companies', 'shuttles', 'reviews'}), ('intermediate', {'ingestion.int_typed_companies', 'ingestion.int_typed_shuttles@pandas2', 'ingestion.int_typed_reviews', 'ingestion.int_typed_shuttles@pandas1'}), ('primary', {'prm_spine_table', 'prm_shuttle_company_reviews'}) .... )])

On kedro-viz we simply read the above, and match the dataset name to the key it belongs to. (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/repositories/catalog.py#L40-L47)

I suppose now this will change so Kedro will only send a dict called metadata to Kedro-viz. And Kedro-viz will extract all the layer information and map it correctly.

This is fine but I am not sure how to make it backward compatible in a clean way @merelcht @AntonyMilneQB ?

kedro-org / kedro