kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Enable adding new attributes to datasets #2440

Closed merelcht closed 1 year ago

merelcht commented 1 year ago

Description

Implement the feature to allow users to add new attributes to datasets. Use the syntax decided in #2439

Context

Sub-task of https://github.com/kedro-org/kedro/issues/1076

To be done after: Decide on syntax to allow adding new attributes#2439

Open question

Where should the metadata (or other name) attribute go?

merelcht commented 1 year ago

I think the handling of the metadata attribute should happen on the individual datasets. Any processing of the info in the metadata will be done by external plugins and not the Kedro framework itself.

The only complication here is that layers are processed inside the DataCatalog. It used to be on the individual datasets as well, but was moved to the catalog later on: https://github.com/quantumblacklabs/private-kedro/pull/548/files

If layers needs to go under metadata like this:

my_dataset:
       ....
       metadata:
            kedro-viz:
                 layer: raw

We'll need to ensure they're still processed correctly and also make sure the implementation is backwards compatible with having layers outside the metadata key.

AhdraMeraliQB commented 1 year ago

After discussion with @merelcht and @AntonyMilneQB the implementation should be as follows:

These require further subtasks that I will define in separate issues, but I would like to get @idanov's and @rashidakanchwala's opinions here first.

The order of tasks would look like this:

  1. Add metadata attribute to datasets
  2. Add logic to Viz to check for layer from metadata->kedro-viz->layer before catalog.layer & move over this check
  3. Add deprecation warnings to the top-level layer attribute on Kedro framework
    • should the deprecation warnings be triggered just whenever a top-level layer is defined, or upon both definition and access?
  4. Remove the top-level layer attribute (breaking) and the logic introduced to Viz in step 2

Questions to consider:

  1. Why was layer moved to the DataCatlog originally (in the ticket above)? Is there something that hasn't been addressed?
  2. Are there any oversights on moving the handling of the layer attribute to Viz?
  3. Are there any other users of the top-levellayer attribute?

An alternative approach:

Alternatively, we could just keep layer where it is and not process it if defined within the metadata. This doesn't address the DataCatalog processing an attribute that isn't used within Kedro, but allows for the introduction of the metadata attribute to be made without any additional implementation on the Kedro framework.

rashidakanchwala commented 1 year ago

Questions to consider: - Why was layer moved to the DataCatlog originally (in the ticket above)? Is there something that hasn't been addressed? This is just my assumption but layers have been on Kedro-viz since 2020 and prior to this there was no other extra information that was passed in catalog from kedro project to kedro-viz, that's why probably there was no need to have it nested? - not a great idea for sure.

Are there any oversights on moving the handling of the layer attribute to Viz? Let me revert on this.

Are there any other users of the top-level layer attribute? I doubt, it seems very specific to Kedro-viz.

rashidakanchwala commented 1 year ago

Currently the layer logic resides in Kedro Framework in DataCatalog (https://github.com/kedro-org/kedro/blob/main/kedro/io/data_catalog.py#L270-L275). Kedro basically sends to kedro-viz - all the layers as a dict in the format below

dict_items([('raw', {'companies', 'shuttles', 'reviews'}), ('intermediate', {'ingestion.int_typed_companies', 'ingestion.int_typed_shuttles@pandas2', 'ingestion.int_typed_reviews', 'ingestion.int_typed_shuttles@pandas1'}), ('primary', {'prm_spine_table', 'prm_shuttle_company_reviews'}) .... )])

On kedro-viz we simply read the above, and match the dataset name to the key it belongs to. (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/data_access/repositories/catalog.py#L40-L47)

I suppose now this will change so Kedro will only send a dict called metadata to Kedro-viz. And Kedro-viz will extract all the layer information and map it correctly.

This is fine but I am not sure how to make it backward compatible in a clean way @merelcht @AntonyMilneQB ?