kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Allow new attributes to be added to DataSets #1076

Closed yetudada closed 1 year ago

yetudada commented 2 years ago

Problem

The Data Catalog does not allow users to add custom attributes or metadata to datasets. This means that users have to:

This request has presented itself in different ways, most commonly in a request to add dataset parameters and not use parameters.yml for parameters that are directly related to a dataset (link). It has also shown itself in users wanting to add optional tags for datasets and have a way to interact with their user-defined tags (link).

User evidence

Create their own custom datasets "I have certain attributes to track within my datasets and have created custom DataSets to get around this issue. Now that hooks are out most of my reasons for custom DataSets are gone, and I can achieve the same thing with an after_node_run hook, but I still cannot attach custom attributes to datasets." - Waylon Walker (link)

Use comments in their to help them keep track of associated metadata "Am I the only one that leaves random notes in the catalog next to at least the raw datasets? Was thinking if there would be any usage of having a meta or info field." - Andrej Marsic

Connect Parameters and the Data Catalog "There could easily be other cases where a node might need to know what parameter(s) a dataset was loaded with. You can always just duplicate the list in both yaml files, but it's more ideal to have the parameter specified in only one place, especially it's some parameter you can play around with." - D. Sturgeon (link)

Possible implementation

My suggestion is for the following:

It could look something like this:

dataset_1:
  type: ...
  filepath: ...
  attribute_1: ...
  attribute_2: ...

or even:

dataset_1:
  type: ...
  filepath: ...
  parameters: # or whatever this name is
    attribute_1: ...
    attribute_2: ...

Questions

datajoely commented 2 years ago

I'll also highlight the great work dbt do with adding documentation and tests to their source and models: https://docs.getdbt.com/docs/building-a-dbt-project/using-sources#testing-and-documenting-sources

WaylonWalker commented 2 years ago

I completely missed this issue. I'm excited to see this one happen. We had another use case this week. We are using some versioned datasets and would like to be able to create a plugin to clean up some of the old versions, inspired by sanoid. It would be nice if we could pop some extra config into the catalog to specify a policy of how many to keep, then we can create a separate plugin that can clean these up in a scheduled run. And keep all of our configurations for each dataset together.

my_dataset:
  versioned: true
  extras: # whatever place you want to give us to configure plugins/extras
    cleanup_old_versions_plugin:
      frequent: 4
      daily: 7
      year: 1
      month: 1
deepyaman commented 2 years ago

Several other libraries allow users to add a metadata key to datasets, including Kubeflow Pipelines. I thought Azure ML Pipelines also allowed this; will update if I find it.

antonymilne commented 2 years ago

DVC allows this through a meta key: https://dvc.org/doc/user-guide/project-structure/dvc-files#specification. image

datajoely commented 2 years ago

So my next question is - then what? The user as provided some arbitrary metadata:

merelcht commented 2 years ago

This issue was discussed in a Technical Design session:

Open questions that need to be answered:

or

my_dataset:
     type: ...
     filepath: ...
     kedro-viz: 
          layer: ...
     kedro-mlflow: ....
     kedro-great-expectations: ....

A top-level key would make it clearer that any entries under it aren't core Kedro, but custom attributes that will be handled by a plugin. However, the top-level key does add extra nesting, and especially for the layer attribute this could be annoying for users.

Next step:

Design an implementation proposal for allowing new attributes to be added to DataSets. Make sure to address the questions above. Discuss this proposal with plugin developers e.g. @WaylonWalker and @Galileo-Galilei

foxale commented 2 years ago

This is exactly what we were looking for today. We were thinking about parametrizing pandas-profiling package, in a way that allows you to decide which dataset to profile and how.

The top-level key looks like a great solution. I would argue, that the layer parameter now makes more sense, especially since kedro-viz is in fact a separate plugin.

GabrieleCacchioni commented 2 years ago

This is something my team and me would love. Use case: display in kedro viz, possibly by displaying datasets with a certain tag close to each other (hard to do if you choose to allow for nested tags, I believe)

Galileo-Galilei commented 1 year ago

Hi, I'd love to see this feature implemented. My main use-case would be to add a schema (backed by pandera, pydantic, great expectation...) in my catalog. This would enable:

I can create an extra configuration file where the user declares all these attributes, but this would be a much better user experience to be able to declare this schema in the catalog.

Regarding the open questions:

@merelcht Do you want me to draft a PR for this ? I don't want to waste my time and yours if you do not have time to review or if you think that technical design is not mature enough, but if you think we have a chance to include it to the core library in say, a couple of months I'd be glad to provide support for this.

merelcht commented 1 year ago

Hi @Galileo-Galilei, thanks for responding to this issue. I'd be very happy for you to start drafting an implementation. We agreed with the team that this is functionality we want to support. We simply haven't been able to prioritise this yet. So if we can speed this up by getting your help, I'm all for it! 👍 😄

merelcht commented 1 year ago

Hi @Galileo-Galilei , we'll be prioritising this task in our next sprint, so I just wanted to check if you'd made any progress or have any findings we should probably know about? In https://github.com/kedro-org/kedro/issues/2409 you mentioned you might need to change the AbstractDataSet, why is that?

Galileo-Galilei commented 1 year ago

Hi @merelcht, as usual I overestimated the time I can allocate for this PR, sorry 😅

I played a bit with the syntax, and I ultimately wanted to add an extra metadata argument to the AbstractDataSet.__init__ method, but I was afraid that it forces me to modify all the datasets.

merelcht commented 1 year ago

Hi @merelcht, as usual I overestimated the time I can allocate for this PR, sorry 😅

I played a bit with the syntax, and I ultimately wanted to add an extra metadata argument to the AbstractDataSet.__init__ method, but I was afraid that it forces me to modify all the datasets.

No worries @Galileo-Galilei and thanks for getting back to my question! I personally think it would make sense to have the attribute on the individual datasets, but not on the AbstractDataSet. What do you think about that? It's still likely that we'll update all datasets, but at least it's not forced by the change of the AbstractDataSet.

inigohidalgo commented 1 year ago

Hiya, just jumping in to add some feedback that arose from a conversation with @datajoely:

At the moment I'm looking into adding validation to some datasets, potentially using pandera. Pandera supports yml syntax (like so) for defining these validations, so if I've understood the purpose of this feature correctly, it could be a good match for this. One potential source of friction I see is that things like validations could change a lot more often than the dataset definition itself, and are also very verbose, so I was wondering whether it would make sense to somehow provide these additional attributes from a different file (e.g. validations.yml in this case) which could then be merged with catalog.yml at catalog creation.

My perspective was something like this, relying on OmegaConf's subkey merging:

catalog.yml:

sales_intermediate:
  filename: ...
  layer: ...
  save_kwargs: ...

validations.yml:

sales_intermediate:
  expected_schema:
    col1: ...
    col2: ...

And Joel's suggestion was something like this:

sales_intermediate:
  filename: ...
  layer: ...
  save_kwargs: ...
  validations : ${../validations.yml}
antonymilne commented 1 year ago

Hi @inigohidalgo, I agree it would be great to do this and indeed OmegaConf should enable us to inject yaml like this. Since we merge together all files matching the catalog pattern (by default "catalog*", "catalog*/**", "**/catalog*") it shouldn't make any different which of those files you define your validation rules in, so long as it matches that pattern. So you'd have something like:

# catalog.yml
sales_intermediate:
  filename: ...
  metadata:
    validations: ${sales_intermediate_validation}  # no need to say which file this is defined in

# catalog_something.yml
sales_intermediate_validation:
 schema_type: dataframe
 version: {PANDERA_VERSION}
 columns:
   column1:
     title: null
     description: null
     dtype: int64
     nullable: false

The only catch I can think of here is that we might need something to prevent kedro from trying to process sales_intermediate_validation as a dataset definition (which would fail), e.g. you might need to prefix its name with _ so it gets ignored (we already do this for yaml anchors - see paragraph just above this).

inigohidalgo commented 1 year ago

In a way this is the same concept as yaml anchors which we already use in our usecases for dataset definitions, we could do this without omegaconf even. I was wondering if you'd be able to do this merge how I outlined it in my example, leveraging OmegaConf's nested sub-config merge. Laying it out like this should get around the issue of kedro processing it as a dataset (I'm making an assumption about the order of operations in Kedro: that it does all this config merging and interpolating first, and then with the final catalog dictionary, you instantiate the Catalog):

sales_intermediate:
  # from catalog.yml
  filename: ...
  layer: ...
  save_kwargs: ...
  # from validations.yml
  expected_schema: 
    col1: ...
    col2: ...

Note: I haven't read the entire issue in-depth, I notice you used a sub-key metadata: to indicate this additional metadata, which would mean the required validations.yml to match the structure you wrote would be:

sales_intermediate:
  metadata:
    pandera_schema:
      version: ...
      columns: ...

But the point is the same :)

antonymilne commented 1 year ago

Oh I see - sorry I misunderstood your previous example. You're right that what I wrote above is just a cross-file extension of what yaml anchors can do.

I didn't think of doing it the way you suggested here but it makes perfect sense. Aside from the fact that metadata doesn't exist yet, this almost works perfectly already and should be easy for us to enable. Here's a toy example where you split the definition of a catalog entry into two files:

# catalog_types.yml
example_iris_data:
  type: pandas.CSVDataSet

# catalog_filepaths.yml
example_iris_data:
  filepath: data/01_raw/iris.csv

Right now the config loader will throw an error about duplicate keys but if we remove omegaconfigloader's call to self._check_duplicates(seen_file_to_keys) (which probably is just inherited from the original configloader and shouldn't be there for exactly this reason) then it will merge these two entries together to give the full one as you'd expect:

example_iris_data:
  type: pandas.CSVDataSet
  filepath: data/01_raw/iris.csv

So yes, in future, I think this should work exactly like you say 🙂 The only thing you'd need to be careful of is to make sure that your validations.yml file gets picked up by the config loader by naming it something that matches the pattern (by default "catalog", "catalog/", "/catalog*" but should be customisable).

yetudada commented 1 year ago

@WaylonWalker @inigohidalgo @foxale @GabrieleCacchioni We're excited to announce that this feature has been added and will be released soon. #2537 introduced the changes, this metadata attribute is ignored by Kedro, but may be consumed by users or external plugins.

It works in the following way:

shuttles:
  type: pandas.ExcelDataSet
  filepath: data/01_raw/shuttles.xlsx
  metadata:
    plugin_1: 
      attribute: ...
    plugin_2:
      attribute: ...

I'll close this issue but please shout when you use it ❤️ Thank you so much for giving us feedback on this.