`DataCatalog` can be mutated but changes are not reflected in the session

astrojuanlu commented 1 year ago

Today I wanted to apply an "advanced" hook use case: storing the catalog and then injecting datasets on the fly. However, it doesn't work:

class MissingDatasetHooks:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog):
        self._catalog = catalog

    @hook_impl
    def before_dataset_loaded(self, dataset_name):
        dataset = self._catalog._get_dataset(dataset_name)
        try:
            dataset.load()
        except DataSetError:
            # Create EmptyDataset on the fly
            logger.warning("Attempted to load dataset %s which doesn't exist yet, injecting it", dataset_name)
            missing_dataset = MissingDataSet(dataset=dataset)
            self._catalog.add(dataset_name, missing_dataset, replace=True)

the self._catalog that gets saved receives the .add(..., replace=True) correctly, but the catalog.load that comes immediately after the before_dataset_loaded hook still has the old dataset:

https://github.com/kedro-org/kedro/blob/fd8162d5bf384ef666c01ef2c529d01fd9fa8354/kedro/runner/runner.py#L403-L404

Using after_context_created gets the same result.

Context: I was trying to give a workaround for https://stackoverflow.com/q/76557758/554319.

Is this behavior expected?

Originally posted by @astrojuanlu in https://github.com/kedro-org/kedro/issues/2690#issuecomment-1607746840

gitgud5000 commented 11 months ago

I've been trying to achieve this aswell. It would be a great feature to add.

astrojuanlu commented 11 months ago

@gitgud5000 Can you detail a bit more what are you trying to achieve and why? We are trying to come up with use cases for this change.

gitgud5000 commented 11 months ago

@astrojuanlu I'm trying to pass additional save_args based on the output of the node that creates a pandas.SQLTableDataSet as output.

My plan was to modify the catalog in a before_dataset_saved hook.

gitgud5000 commented 11 months ago

What I ended up doing was implementing a subclass of pandas.SQLTableDataSet

noklam commented 11 months ago

I don't think mutating DataCatalog is a good thing to do. I believe the hook system was designed in a way to avoid this exactly (correct me if I am wrong).

I am unsure if this example correct though because dataset.load() just load the dataset but not returning anything, how would it work? Maybe need an example to play with to investigate this.

There are however use cases that is very useful. i.e. If you developing in a notebook environment, it is very convenient if you can inject/change data to test your pipeline without re-running the whole pipeline

astrojuanlu commented 11 months ago

I don't think mutating DataCatalog is a good thing to do.

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

noklam commented 11 months ago

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

I think we need to clarify what's not working here. DataCatalog.add should work in a standalone mode? What doesn't work here is because catalog is not stored in KedroContext but rather hot-reload everytime it gets called.

https://github.com/kedro-org/kedro/issues/1460

I think there are two conflicting features here and we need to think of it carefully.

Hot-reload feature, which is useful for notebook because it saves you from keep creating new catalog when you are editing catalog.yml on the side. This because less important with %reload_kedro, thus #1460. However, if we are now steering to using components separately. We may need to re-think what should be done here.
- 2819
Immutable KedroContext - https://github.com/kedro-org/kedro/issues/1459, I re-open this issue recently.
- This was merged already, but as we approaching 0.19, I want to revise if it's a good idea to freeze KedroContext. We should also consider how easy for plugin developer to use KedroContext. In the past we were more negative about storing context and states in Hooks for following hooks. However in https://github.com/kedro-org/kedro/issues/2690, we may want to document this. If this is not a bad idea, do we need to re-visit the decision we made?

In any case, we should list out why we want to make this a singleton. I don't have the full context why it was designed that way, may be good to dig out the old issues and PR, but I don't have it now.

astrojuanlu commented 7 months ago

Another user was confused by this but found a workaround that worked for their use case: using runner.run(..., catalog=new_catalog) https://linen-slack.kedro.org/t/16062852/hi-all-how-can-i-update-replace-catalog-entries-from-an-exis#e163f246-de25-405d-9462-7fbc757bf927

noklam commented 3 months ago

Maybe the action item for this ticket is:

Create documentation to explain DataCatalog immutability

Is there a good reason Kedro should start supporting this?

kedro-org / kedro

`DataCatalog` can be mutated but changes are not reflected in the session #2728

2819