Closed merelcht closed 1 year ago
The top-level key is a much simpler solution to handle on the Kedro side, because it would only require the parsing of the metadata key and not the unlimited possibilities of keys that could go under it.
Regardless of where we add the key parsing e.g. on the individual dataset level, catalog, or AbstractDataset
, on the Kedro side we can just ignore whatever is under it and the plugins can look specifically for the metadata
key and then whatever sub-key they have chosen, e.g. metadata.kedro-viz
.
I suggest we go for the name metadata
for the top-level key, similar to how other tools refer to extra information about datasets:
meta
key: https://dvc.org/doc/user-guide/project-structure/dvc-files#specificationmetadata
to pipeline artifacts: https://www.kubeflow.org/docs/components/pipelines/v1/sdk-v2/v2-component-io/#use-v2-component-decorator-for-python-function-based-componentsmetadata
for datasets: https://learn.microsoft.com/en-us/azure/machine-learning/component-reference/edit-metadataOn the point of the layer
key for the kedro-viz layers, I'd suggest we add a solution that can handle the layer
being both inside a kedro-viz
key as well as on the level where it is now. Also to ensure that adding the metadata
level doesn't break anything.
An alternative option to the metadata
key would be, mentioned by @idanov:
It's still a vague idea, but something along the lines of extending the @.... syntax E.g. if you have an output called output and then you can have in your catalog the following:
output:
type: ....
....
output@plotly:
type: ....
....
output@preview:
type: ....
....
output@summary:
type: ....
....
I found an old discussion about this on the private-kedro
repo where @lorenabalan suggested to treat "metadata" similarly to credentials: "i.e. you can specify it in a separate file & interpolate (main preference), but technically you can also expand it within the catalog.yml
entry"
So you could have a catalog.yml
file and metadata.yml
file. I feel like that could be a good mid-way solution where the catalog file wouldn't get too much bloated.
I think we've established now that there's a legitimate need for a way to add new attributes to datasets, with the dataset preview being one of the use cases.
@AntonyMilneQB and @idanov could you comment with what your preferred solution would be?
So if I summarise correctly, there are four approaches:
catalog.yml
entry by either making a top-level metadata
level field (A) or free-form keys (B)...@syntax
(C)metadata.yml
file (D)I want to ground this conversation in a look at an example. If we apply this approach to the layer
attribute for Kedro-Viz, because this design should affect that too; it would make the examples look like:
# catalog.yml
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
metadata:
kedro-viz:
layer: raw
# catalog.yml
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
kedro-viz:
layer: raw
I was confused about how the ...@syntax
would work here, would we have to specify type
and filepath
?
# catalog.yml
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
companies@layer:
type: pandas.CSVDataSet
layer: raw
filepath: data/01_raw/companies.csv
# catalog.yml
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
# metadata.yml
companies_metadata:
layer: raw
tl;dr: I like the metadata
top-level key approach without a separate metadata.yml
file (@yetudada Approach A). The only question in my mind is how things are structured below that point, i.e. which of these do we prefer?
# Option A1. Clear separation between different plugin responsibilities, gives more grouping.
dataset:
type: ...
metadata:
kedro-viz:
layer: raw
preview: ...
# Option A2. Less nested. Means that e.g. layer could be used in other plugins rather than being kedro-viz specific.
# Might lead to naming conflicts though
dataset:
type: ...
metadata:
layer: raw
preview: ...
Probably option 1 is best, but we can worry about this later anyway. Basically for the purpose of kedro framework, we don't use anything that's in metadata
so don't really care about the structure there. It can be up to the individual user/plugin author to decide what they call their keys and how they structure them.
metadata.yml
file (@yetudada Approach D): no need to do this now or maybe everI was also looking for this discussion to remind myself on it because it's got many valuable points I think. So thank you for bringing it up 👍
I was a big fan of this at the time and still think it has its merits, but IMO the main advantage of it was that it would reduce repetition in catalog.yml by allowing you to inject repeated metadata
into many entries without having to explicit repeat it across many catalog entries.
However, the situation is a bit different now compared to before since we now have omegaconfig and are actively trying to solve https://github.com/kedro-org/kedro/issues/2423. Assuming we can solve https://github.com/kedro-org/kedro/issues/2423 then the issue of repeating metadata
across many catalog entries would be much less significant.
The suggestion was to use the same mechanism to inject metadata
that we already use for credentials
. This behaviour of magically injecting credentials is already a little bit weird but makes sense in that special case (to stop you from leaking credentials). Since we now have another new mechanism for variable interpolation (omegaconfig), I am now less keen to promote our own custom mechanism in another place.
If we do want to it that then we can always add it later anyway, because putting metadata
inline in a catalog.yml should still remain possible. Otherwise I think we aren't solving the simple case here at all (@Galileo-Galilei "I can create an extra configuration file where the user declares all these attributes, but this would be a much better user experience to be able to declare this schema in the catalog.").
@
syntax (@yetudada Approach C): I prefer metdata
keyTo be honest I don't really understand this well enough to offer much of an informed opinion on it, but from what I see here I am not a big fan. A metadata
key seems much more intuitive and obvious to users and something similar is done in many other tools. The @
syntax feels kind of weird and unnatural. To me, transcoding sort of makes sense to have a special @
syntax because it determines the topology of the execution graph, but the other bits of metadata don't (there could be particular confusion here with @layer
, which affects the kedro-viz topology but not node execution order).
Purely from an implementation point of view it also sounds more complex to me too (need to parse all catalog entries with @
in their name in a special way, do transcoding in some cases but not in others) vs. just ignore the metadata
key.
Looks like I posted at the same type as @yetudada but I basically agree with her here. Both my option 1 and 2 are sub-types of Approach A, which is my preferred one too.
I think the code snippets for Approach C and D should look more like this though:
I'm pretty sure you wouldn't repeat type
and filepath
here, though would be good if @idanov could give an example to clarify since I also don't really understand the idea.
# catalog.yml
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
companies@layer:
layer: raw
Here you would still need to provide metadata: companies_metadata
in the catalog entry to provide the mapping to metadata.yml - just like credentials
, it's not done automatically (e.g. if you want to provide same metadata to many catalog entries):
# catalog.yml
companies:
type: pandas.CSVDataSet
filepath: data/01_raw/companies.csv
metadata: companies_metadata # Need this line
# metadata.yml
companies_metadata:
layer: raw
Thanks @yetudada and @AntonyMilneQB for the comments! Looking at all the examples, my preference still goes to Approach A.
I like the idea of A2, described by Antony, but I worry about the clarity of use of it. As a new user coming into a project, you’d have to go and find out which metadata key does what, because some might be kedro-viz
specific, another one kedro-mlflow
etc.. And since we’ll let the plugins implement these we wouldn’t be able to document all key options on our side.
I also favor the A1 approach since that's the most intuitive one to me.
I share some concern as @AntonyMilneQB about namespace conflicting - or do we need to take one step more to protect the namespace in case there are multiple plugins. But I am not overly concern about this, since for plugins we are also not handling this right now (potentially plugins can override each other and cause weird behavior) and no one is complaining.
Just an additional argument against the @ syntax: it would couple tightly the framework (catalog and ConfigLoader) and the datasets because there will have some extra magic happening while loading the catalog with the ConfigLoader. This is not necessarily bad, but this creates some inconsitencies between the yaml and the python API. While the python API is not the recommended way to create datasets, I find it useful for a couple of use cases:
Regarding the extra file, this may become something useful if plugins were to introduce very long and complex attributes (e.g. the entire schema of a big table) but this will be very easy to add if needed later. For now as @AntonyMilneQB mentions I feel as a kerdo user that it would be much more readable to have these attributes near the dataset they belong so we I can see at a glance.
Just for clarity : I'd vote for approach A1 too ;)
Thanks everyone for sharing your thoughts and preferences! It's very clear that everyone that commented here prefers the option with the top-level metadata
key. I will now close this issue as the decision on the syntax is made and the implementation will be done in https://github.com/kedro-org/kedro/issues/2440
Description
https://github.com/kedro-org/kedro/issues/1076
Context
or
A top-level key would make it clearer that any entries under it aren't core Kedro, but custom attributes that will be handled by a plugin. However, the top-level key does add extra nesting, and especially for the
layer
attribute this could be annoying for users.metadata
,custom_data
, .... ? The key will be calledmetadata
Possible Implementation
@Galileo-Galilei mentioned:
I'd go for a top level key rather than storing everything low level. This wil make explicit to users that the informations stored here are not provided by kedro (it will save you a lot of time due to questions not related to kedro, or possible "inconsistencies" between your documentation and their catalog due to third party plugins).
For this key, I prefer the name metadata which is more explicit than other propositions and kind of a standard but honestly I'll be fine with other propositions.
Possible Alternatives
(Optional) Describe any alternative solutions or features you've considered.