kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Decide on syntax to allow adding new attributes #2439

Closed merelcht closed 1 year ago

merelcht commented 1 year ago

Description

https://github.com/kedro-org/kedro/issues/1076

Context

or

my_dataset:
     type: ...
     filepath: ...
     kedro-viz: 
          layer: ...
     kedro-mlflow: ....
     kedro-great-expectations: ....

A top-level key would make it clearer that any entries under it aren't core Kedro, but custom attributes that will be handled by a plugin. However, the top-level key does add extra nesting, and especially for the layer attribute this could be annoying for users.

Possible Implementation

@Galileo-Galilei mentioned:

Possible Alternatives

(Optional) Describe any alternative solutions or features you've considered.

merelcht commented 1 year ago

The top-level key is a much simpler solution to handle on the Kedro side, because it would only require the parsing of the metadata key and not the unlimited possibilities of keys that could go under it. Regardless of where we add the key parsing e.g. on the individual dataset level, catalog, or AbstractDataset, on the Kedro side we can just ignore whatever is under it and the plugins can look specifically for the metadata key and then whatever sub-key they have chosen, e.g. metadata.kedro-viz.

I suggest we go for the name metadata for the top-level key, similar to how other tools refer to extra information about datasets:

On the point of the layer key for the kedro-viz layers, I'd suggest we add a solution that can handle the layer being both inside a kedro-viz key as well as on the level where it is now. Also to ensure that adding the metadata level doesn't break anything.

merelcht commented 1 year ago

An alternative option to the metadata key would be, mentioned by @idanov:

It's still a vague idea, but something along the lines of extending the @.... syntax E.g. if you have an output called output and then you can have in your catalog the following:

output:
    type: ....
....
output@plotly:
    type: ....
....
output@preview:
    type: ....
....
output@summary:
    type: ....
....
merelcht commented 1 year ago

I found an old discussion about this on the private-kedro repo where @lorenabalan suggested to treat "metadata" similarly to credentials: "i.e. you can specify it in a separate file & interpolate (main preference), but technically you can also expand it within the catalog.yml entry"

So you could have a catalog.yml file and metadata.yml file. I feel like that could be a good mid-way solution where the catalog file wouldn't get too much bloated.

I think we've established now that there's a legitimate need for a way to add new attributes to datasets, with the dataset preview being one of the use cases.

@AntonyMilneQB and @idanov could you comment with what your preferred solution would be?

yetudada commented 1 year ago

So if I summarise correctly, there are four approaches:

I want to ground this conversation in a look at an example. If we apply this approach to the layer attribute for Kedro-Viz, because this design should affect that too; it would make the examples look like:

Approach A

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
  metadata:
    kedro-viz:
       layer: raw

Approach B

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
  kedro-viz:
    layer: raw

Approach C

I was confused about how the ...@syntax would work here, would we have to specify type and filepath?

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv

companies@layer: 
  type: pandas.CSVDataSet
  layer: raw
  filepath: data/01_raw/companies.csv

Approach D

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
# metadata.yml

companies_metadata:
  layer: raw

General thoughts

antonymilne commented 1 year ago

tl;dr: I like the metadata top-level key approach without a separate metadata.yml file (@yetudada Approach A). The only question in my mind is how things are structured below that point, i.e. which of these do we prefer?

# Option A1. Clear separation between different plugin responsibilities, gives more grouping.
dataset:
  type: ...
  metadata:
    kedro-viz:
      layer: raw
      preview: ...

# Option A2. Less nested. Means that e.g. layer could be used in other plugins rather than being kedro-viz specific.
# Might lead to naming conflicts though
dataset:
  type: ...
  metadata:
    layer: raw
    preview: ...

Probably option 1 is best, but we can worry about this later anyway. Basically for the purpose of kedro framework, we don't use anything that's in metadata so don't really care about the structure there. It can be up to the individual user/plugin author to decide what they call their keys and how they structure them.


metadata.yml file (@yetudada Approach D): no need to do this now or maybe ever

I was also looking for this discussion to remind myself on it because it's got many valuable points I think. So thank you for bringing it up 👍

I was a big fan of this at the time and still think it has its merits, but IMO the main advantage of it was that it would reduce repetition in catalog.yml by allowing you to inject repeated metadata into many entries without having to explicit repeat it across many catalog entries.

However, the situation is a bit different now compared to before since we now have omegaconfig and are actively trying to solve https://github.com/kedro-org/kedro/issues/2423. Assuming we can solve https://github.com/kedro-org/kedro/issues/2423 then the issue of repeating metadata across many catalog entries would be much less significant.

The suggestion was to use the same mechanism to inject metadata that we already use for credentials. This behaviour of magically injecting credentials is already a little bit weird but makes sense in that special case (to stop you from leaking credentials). Since we now have another new mechanism for variable interpolation (omegaconfig), I am now less keen to promote our own custom mechanism in another place.

If we do want to it that then we can always add it later anyway, because putting metadata inline in a catalog.yml should still remain possible. Otherwise I think we aren't solving the simple case here at all (@Galileo-Galilei "I can create an extra configuration file where the user declares all these attributes, but this would be a much better user experience to be able to declare this schema in the catalog.").

@ syntax (@yetudada Approach C): I prefer metdata key

To be honest I don't really understand this well enough to offer much of an informed opinion on it, but from what I see here I am not a big fan. A metadata key seems much more intuitive and obvious to users and something similar is done in many other tools. The @ syntax feels kind of weird and unnatural. To me, transcoding sort of makes sense to have a special @ syntax because it determines the topology of the execution graph, but the other bits of metadata don't (there could be particular confusion here with @layer, which affects the kedro-viz topology but not node execution order).

Purely from an implementation point of view it also sounds more complex to me too (need to parse all catalog entries with @ in their name in a special way, do transcoding in some cases but not in others) vs. just ignore the metadata key.

antonymilne commented 1 year ago

Looks like I posted at the same type as @yetudada but I basically agree with her here. Both my option 1 and 2 are sub-types of Approach A, which is my preferred one too.

I think the code snippets for Approach C and D should look more like this though:

Approach C

I'm pretty sure you wouldn't repeat type and filepath here, though would be good if @idanov could give an example to clarify since I also don't really understand the idea.

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv

companies@layer: 
  layer: raw

Approach D

Here you would still need to provide metadata: companies_metadata in the catalog entry to provide the mapping to metadata.yml - just like credentials, it's not done automatically (e.g. if you want to provide same metadata to many catalog entries):

# catalog.yml

companies:
  type: pandas.CSVDataSet
  filepath: data/01_raw/companies.csv
  metadata: companies_metadata  # Need this line
# metadata.yml

companies_metadata:
  layer: raw
merelcht commented 1 year ago

Thanks @yetudada and @AntonyMilneQB for the comments! Looking at all the examples, my preference still goes to Approach A.

I like the idea of A2, described by Antony, but I worry about the clarity of use of it. As a new user coming into a project, you’d have to go and find out which metadata key does what, because some might be kedro-viz specific, another one kedro-mlflow etc.. And since we’ll let the plugins implement these we wouldn’t be able to document all key options on our side.

noklam commented 1 year ago

I also favor the A1 approach since that's the most intuitive one to me.

I share some concern as @AntonyMilneQB about namespace conflicting - or do we need to take one step more to protect the namespace in case there are multiple plugins. But I am not overly concern about this, since for plugins we are also not handling this right now (potentially plugins can override each other and cause weird behavior) and no one is complaining.

Galileo-Galilei commented 1 year ago

Just an additional argument against the @ syntax: it would couple tightly the framework (catalog and ConfigLoader) and the datasets because there will have some extra magic happening while loading the catalog with the ConfigLoader. This is not necessarily bad, but this creates some inconsitencies between the yaml and the python API. While the python API is not the recommended way to create datasets, I find it useful for a couple of use cases:

Regarding the extra file, this may become something useful if plugins were to introduce very long and complex attributes (e.g. the entire schema of a big table) but this will be very easy to add if needed later. For now as @AntonyMilneQB mentions I feel as a kerdo user that it would be much more readable to have these attributes near the dataset they belong so we I can see at a glance.

Just for clarity : I'd vote for approach A1 too ;)

merelcht commented 1 year ago

Thanks everyone for sharing your thoughts and preferences! It's very clear that everyone that commented here prefers the option with the top-level metadata key. I will now close this issue as the decision on the syntax is made and the implementation will be done in https://github.com/kedro-org/kedro/issues/2440