Open sbrugman opened 9 months ago
Hi @sbrugman, I am having the same issue you mention here, where I want more flexibility and avoid copying common YML configs over and over but I don't seem to find an easy way in the current kedro ecosystem
Did you find any elegant solution?
I think playing with OmegaConf resolvers might make it work but it makes things quite unreadable and complicated
I am adding an interesting Slack discussion here, where the interaction between factories and OmegaConf resolvers is discussed, so that we can have it as a reference on this topic:
Initial message:
Hi, is it possible to use a dataset factory in config resolver? An example:
"{name}_feature": type: pandas.ParquetDataset filepath: data/04_feature/{name}_feature.parquet metadata: pandera: schema: ${pa.python:my_kedro_project.pipelines.feature_preprocessing.schemas.{name}_feature_schema}
The above gives me this:
omegaconf.errors.GrammarParseError: mismatched input '{' expecting BRACE_CLOSE full_key: {name}_feature.metadata.pandera.schema object_type=dict
@sbrugman @MigQ2 sorry for the slightly slow reply here. We're going over old, unaddressed issues.
Would globals.yml
work for your use case?
If globals.yml
injects the YAML anchors in the same file under the hood, then that could work. Is that what you had in mind?
Unclear to me if YAML anchors can be shared. But variables and blocks of YAML definitely can. The way of using them deviates from normal YAML syntax though, and I don't think the "inheritance" provided by YAML anchors is supported in Omegaconf.
There are at least three dynamic techniques to define the Data Catalog currently in Kedro:
There is a common usage pattern for which even a combination of these three is not expressive enough to satisfactory specify the configuration.
Scenario
Each of the pipelines consists of a set of nodes of limited distinct types, e.g. source (read only), intermediate (write local) and final (write to some database).
Multiple people are working on different pipelines, so the catalog is split in a file per pipeline:
The split files provide overview and prevent conflicts when working in parallel. So far so good.
Rapidly, each of these files begins to look like:
Each file contains more or less the same YAML anchors, which according to the specs cannot be shared across files. Is there another way to store this common information in a single file, instead in every single one of them - while keeping a single file per pipeline?
The variable interpolation perhaps? Even though the OmagaConfigLoader can interpolate a dict, this (afaik) does not allow to partially pass a dict, such as we can do in Python (or with the anchors above):
The dataset factories also do not support this pattern. The following would be close, but is too restrictive:
Restrictions:
Desiderata
The user needs to be able to template catalog entries across multiple files. It must be possible to overwrite individual entries and it should be possible that the name is an alias.