kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.02k stars 906 forks source link

Grouping artifacts in the data catalog #4260

Open namedgraph opened 3 weeks ago

namedgraph commented 3 weeks ago

Description

I tried grouping the artifacts by introducing "namespaces" as the first level of config in YAML while moving the actual artifacts to the second level:

a_group_of_artifacts:
  outputs:
    type: ...

  errors:
    type: ...

and was planning to address the artifacts as a_group_of_artifacts:outputs, a_group_of_artifacts:errors etc.

But it turns out that Kedro does not support this?

DatasetError: An exception occurred when parsing config for dataset 'a_group_of_artifacts':
'type' is missing from dataset catalog configuration

Context

Our pipelines mostly augment the initial inputs, which means we end up with a lot of similarly named artifacts (e.g. final_outputs, processed_outputs and other kinds of _outputs) which gets confusing. It feels that there should be a better way to group/namespace the artifacts.

Possible Implementation

Instead of treating the 1st-level YAML blocks as artifacts, why not traverse the levels recursively until a block with type is encountered -- and treating it as artifact while ignoring the other nesting blocks?

Possible Alternatives

Maybe some other solution I don't know about? Not a Kedro expert...

lrcouto commented 3 weeks ago

Hey @namedgraph, thank you for your feature proposal. Your idea makes sense, but as of now, Kedro does not support grouping artifacts in the manner you describe, and interprets each entry on the catalog as a separate data source with it's own type definition.

For now, you can try to use Kedro dataset factories to reduce the number of similar catalog entries on your project.

namedgraph commented 3 weeks ago

@lrcouto it feels inconsistent that one can nest YAML in parameters and use the parent:child syntax, but not in the catalog 🤷‍♂️