kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.48k stars 874 forks source link

`DataCatalog` can be mutated but changes are not reflected in the session #2728

Open astrojuanlu opened 1 year ago

astrojuanlu commented 1 year ago

Today I wanted to apply an "advanced" hook use case: storing the catalog and then injecting datasets on the fly. However, it doesn't work:

class MissingDatasetHooks:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog):
        self._catalog = catalog

    @hook_impl
    def before_dataset_loaded(self, dataset_name):
        dataset = self._catalog._get_dataset(dataset_name)
        try:
            dataset.load()
        except DataSetError:
            # Create EmptyDataset on the fly
            logger.warning("Attempted to load dataset %s which doesn't exist yet, injecting it", dataset_name)
            missing_dataset = MissingDataSet(dataset=dataset)
            self._catalog.add(dataset_name, missing_dataset, replace=True)

the self._catalog that gets saved receives the .add(..., replace=True) correctly, but the catalog.load that comes immediately after the before_dataset_loaded hook still has the old dataset:

https://github.com/kedro-org/kedro/blob/fd8162d5bf384ef666c01ef2c529d01fd9fa8354/kedro/runner/runner.py#L403-L404

Using after_context_created gets the same result.

Context: I was trying to give a workaround for https://stackoverflow.com/q/76557758/554319.

Is this behavior expected?

Originally posted by @astrojuanlu in https://github.com/kedro-org/kedro/issues/2690#issuecomment-1607746840

gitgud5000 commented 11 months ago

I've been trying to achieve this aswell. It would be a great feature to add.

astrojuanlu commented 11 months ago

@gitgud5000 Can you detail a bit more what are you trying to achieve and why? We are trying to come up with use cases for this change.

gitgud5000 commented 11 months ago

@astrojuanlu I'm trying to pass additional save_args based on the output of the node that creates a pandas.SQLTableDataSet as output.

My plan was to modify the catalog in a before_dataset_saved hook.

Similar to this: https://github.com/kedro-org/kedro/discussions/910 This is also related: https://github.com/kedro-org/kedro/discussions/898

gitgud5000 commented 11 months ago

What I ended up doing was implementing a subclass of pandas.SQLTableDataSet

noklam commented 11 months ago

I don't think mutating DataCatalog is a good thing to do. I believe the hook system was designed in a way to avoid this exactly (correct me if I am wrong).

I am unsure if this example correct though because dataset.load() just load the dataset but not returning anything, how would it work? Maybe need an example to play with to investigate this.

There are however use cases that is very useful. i.e. If you developing in a notebook environment, it is very convenient if you can inject/change data to test your pipeline without re-running the whole pipeline

astrojuanlu commented 11 months ago

I don't think mutating DataCatalog is a good thing to do.

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

noklam commented 11 months ago

That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?

I think we need to clarify what's not working here. DataCatalog.add should work in a standalone mode? What doesn't work here is because catalog is not stored in KedroContext but rather hot-reload everytime it gets called.

I think there are two conflicting features here and we need to think of it carefully.

In any case, we should list out why we want to make this a singleton. I don't have the full context why it was designed that way, may be good to dig out the old issues and PR, but I don't have it now.

astrojuanlu commented 7 months ago

Another user was confused by this but found a workaround that worked for their use case: using runner.run(..., catalog=new_catalog) https://linen-slack.kedro.org/t/16062852/hi-all-how-can-i-update-replace-catalog-entries-from-an-exis#e163f246-de25-405d-9462-7fbc757bf927

noklam commented 3 months ago

Maybe the action item for this ticket is:

Is there a good reason Kedro should start supporting this?