Use user defined default dataset factory pattern over the one from the runner

ankatiyar commented 1 month ago

Description

Fix #3720

Development notes

TODO:

[x] Update DataCatalog code
[x] Update CLI commands code
[x] Docs
[x] Test
[ ] Release notes

Wanted to get opinions on the proposed solution before proceeding with the pending tasks mentioned above^

Proposed Solution

When `DataCatalog` is created (in `from_config`)

If the last pattern in the sorted patterns list is a catch-all pattern i.e. has specificity == 0, it is popped out and stored separately as the DataCatalog._default_pattern (This would be the user-defined default pattern, not from the runner)

[!NOTE]
This assumes that there is only one catch-all pattern in the user's catalog. If there are more, the last one will be popped which wouldn't ever be used anyway.❓ [Question for reviewers]: We should probably error out if the user has more than one catch-all patterns in this case, does that make sense?

When a dataset is being fetched

Try to match is against the patterns in DataCatalog._dataset_patterns OR DataCatalog._default_pattern
Resolve config again first from _dataset_patterns OR _default_pattern

When runner.run() is called

When kedro run is executed, the runner creates a shallow copy of the catalog object by calling the DataCatalog.shallow_copy()

Here the runner inserts a default dataset pattern e.g. extra_dataset_patterns = {'{default}': MemoryDataset}
IF there is already a _default_pattern - it's not needed to do anything
IF there isn't a user defined catch-all pattern - add it to the DataCatalog._dataset_patterns (this will then not emit a warning about the catch-all pattern being used over the runner default dataset)

For the catalog CLI commands

Similar logic as above has been copied for the matching of datasets to patterns

Checklist

[ ] Read the contributing guidelines
[ ] Signed off each commit with a Developer Certificate of Origin (DCO)
[ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[ ] Updated the documentation to reflect the code changes
[ ] Added a description of this change in the RELEASE.md file
[ ] Added tests to cover my changes
[ ] Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

noklam commented 1 month ago

[Question for reviewers]: We should probably error out if the user has more than one catch-all patterns in this case, does that make sense?

Yes

noklam commented 1 month ago

Just dumping my thought here in case I am blocking the PRs. I have some questions about the shallow_copy approach, but that was done long before this PR so I don't want to expand the discussion too much.

I think:

There should be only 1 default dataset, validation should happen when DataCatalog is initialised. (I think this is consistent as the shallow_copy will create a new DataCatalog, this is where I find the shallow_copy weird as it does not seem to be a reference but completely new object
multiple dataset pattern by Runner is fine. As long as only one of them is default datasets. For example, you could have different default for different namespaces which control by runner.

The most important question here is "should runner dataset patterns override DataCatalog"? In framework setting, the way to override patterns is catalog.yml. However, the order of how framework work is different.

DataCatalog read catalog.yml first
Runner override default patterns (memory dataset)

This suggests that one should only use either catalog.yml or runner. Should it fails or just ignore the dataset pattern from runner? For me, I think I will go for raising an error instead of ignoring the pattern sliently.

kedro-org / kedro