kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.48k stars 874 forks source link

Use user defined default dataset factory pattern over the one from the runner #3859

Closed ankatiyar closed 1 month ago

ankatiyar commented 1 month ago

Description

Fix #3720

Development notes

TODO:

Wanted to get opinions on the proposed solution before proceeding with the pending tasks mentioned above^

Proposed Solution

When DataCatalog is created (in from_config)

When a dataset is being fetched

When runner.run() is called

When kedro run is executed, the runner creates a shallow copy of the catalog object by calling the DataCatalog.shallow_copy()

For the catalog CLI commands

Similar logic as above has been copied for the matching of datasets to patterns

Checklist

noklam commented 1 month ago

[Question for reviewers]: We should probably error out if the user has more than one catch-all patterns in this case, does that make sense?

Yes

noklam commented 1 month ago

Just dumping my thought here in case I am blocking the PRs. I have some questions about the shallow_copy approach, but that was done long before this PR so I don't want to expand the discussion too much.

I think:

The most important question here is "should runner dataset patterns override DataCatalog"? In framework setting, the way to override patterns is catalog.yml. However, the order of how framework work is different.

  1. DataCatalog read catalog.yml first
  2. Runner override default patterns (memory dataset)

This suggests that one should only use either catalog.yml or runner. Should it fails or just ignore the dataset pattern from runner? For me, I think I will go for raising an error instead of ignoring the pattern sliently.