kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.01k stars 904 forks source link

Best practices for sharing configuration across multiple catalog files #3625

Open sbrugman opened 9 months ago

sbrugman commented 9 months ago

There are at least three dynamic techniques to define the Data Catalog currently in Kedro:

There is a common usage pattern for which even a combination of these three is not expressive enough to satisfactory specify the configuration.

Scenario

Each of the pipelines consists of a set of nodes of limited distinct types, e.g. source (read only), intermediate (write local) and final (write to some database).

Multiple people are working on different pipelines, so the catalog is split in a file per pipeline:

conf/base/catalog/
    pipeline_a.yml
    pipeline_b.yml
    pipeline_c.yml

The split files provide overview and prevent conflicts when working in parallel. So far so good.

Rapidly, each of these files begins to look like:

# Templates
_source_table: &source_table
  type: datasets.SourceTable
  read_only: true
  metadata:
    kedro-viz:
      layer: source

# More templates…

_output_table: &output_table
  type: datasets.ProductionDB
  database: constant
  mode: overwrite
  metadata:
    kedro-viz:
      layer: primary

# Datasets
a_ds1:
    <<: *source_table
    database: hello
    table: world

# More datasets…

a_ds_15:
   <<: *output_table
   table: foobar

Each file contains more or less the same YAML anchors, which according to the specs cannot be shared across files. Is there another way to store this common information in a single file, instead in every single one of them - while keeping a single file per pipeline?

The variable interpolation perhaps? Even though the OmagaConfigLoader can interpolate a dict, this (afaik) does not allow to partially pass a dict, such as we can do in Python (or with the anchors above):

Source = {“read_only”: True}

my_table = {
   **source,
    “Database”: “db”
}

The dataset factories also do not support this pattern. The following would be close, but is too restrictive:

"{name}_source”:
  type: datasets.SourceTable
  read_only: true
  metadata:
    kedro-viz:
      layer: source
  Table: {name}

Restrictions:

Desiderata

The user needs to be able to template catalog entries across multiple files. It must be possible to overwrite individual entries and it should be possible that the name is an alias.

MigQ2 commented 2 months ago

Hi @sbrugman, I am having the same issue you mention here, where I want more flexibility and avoid copying common YML configs over and over but I don't seem to find an easy way in the current kedro ecosystem

Did you find any elegant solution?

I think playing with OmegaConf resolvers might make it work but it makes things quite unreadable and complicated

MigQ2 commented 1 month ago

I am adding an interesting Slack discussion here, where the interaction between factories and OmegaConf resolvers is discussed, so that we can have it as a reference on this topic:

https://linen-slack.kedro.org/t/22708331/hi-is-it-possible-to-use-a-dataset-factory-in-config-resolve

Initial message:

Hi, is it possible to use a dataset factory in config resolver? An example:

"{name}_feature":
 type: pandas.ParquetDataset
 filepath: data/04_feature/{name}_feature.parquet
 metadata:
   pandera:
     schema: ${pa.python:my_kedro_project.pipelines.feature_preprocessing.schemas.{name}_feature_schema}

The above gives me this:

omegaconf.errors.GrammarParseError: mismatched input '{' expecting BRACE_CLOSE
   full_key: {name}_feature.metadata.pandera.schema
   object_type=dict
astrojuanlu commented 1 week ago

@sbrugman @MigQ2 sorry for the slightly slow reply here. We're going over old, unaddressed issues.

Would globals.yml work for your use case?

sbrugman commented 1 week ago

If globals.yml injects the YAML anchors in the same file under the hood, then that could work. Is that what you had in mind?

astrojuanlu commented 1 week ago

Unclear to me if YAML anchors can be shared. But variables and blocks of YAML definitely can. The way of using them deviates from normal YAML syntax though, and I don't think the "inheritance" provided by YAML anchors is supported in Omegaconf.