Open astrojuanlu opened 1 year ago
I've been trying to achieve this aswell. It would be a great feature to add.
@gitgud5000 Can you detail a bit more what are you trying to achieve and why? We are trying to come up with use cases for this change.
@astrojuanlu I'm trying to pass additional save_args based on the output of the node that creates a pandas.SQLTableDataSet
as output.
My plan was to modify the catalog in a before_dataset_saved
hook.
Similar to this: https://github.com/kedro-org/kedro/discussions/910 This is also related: https://github.com/kedro-org/kedro/discussions/898
What I ended up doing was implementing a subclass of pandas.SQLTableDataSet
I don't think mutating DataCatalog is a good thing to do. I believe the hook system was designed in a way to avoid this exactly (correct me if I am wrong).
I am unsure if this example correct though because dataset.load()
just load the dataset but not returning anything, how would it work? Maybe need an example to play with to investigate this.
There are however use cases that is very useful. i.e. If you developing in a notebook environment, it is very convenient if you can inject/change data to test your pipeline without re-running the whole pipeline
I don't think mutating DataCatalog is a good thing to do.
That's fair enough, but not the impression that I got when I saw a DataCatalog.add
method existed 🤔 How could we reduce confusion here?
That's fair enough, but not the impression that I got when I saw a DataCatalog.add method existed 🤔 How could we reduce confusion here?
I think we need to clarify what's not working here. DataCatalog.add
should work in a standalone mode? What doesn't work here is because catalog
is not stored in KedroContext
but rather hot-reload everytime it gets called.
I think there are two conflicting features here and we need to think of it carefully.
catalog
when you are editing catalog.yml
on the side. This because less important with %reload_kedro
, thus #1460. However, if we are now steering to using components separately. We may need to re-think what should be done here.
This was merged already, but as we approaching 0.19, I want to revise if it's a good idea to freeze KedroContext. We should also consider how easy for plugin developer to use KedroContext. In the past we were more negative about storing context and states in Hooks for following hooks. However in https://github.com/kedro-org/kedro/issues/2690, we may want to document this. If this is not a bad idea, do we need to re-visit the decision we made?
In any case, we should list out why we want to make this a singleton. I don't have the full context why it was designed that way, may be good to dig out the old issues and PR, but I don't have it now.
Another user was confused by this but found a workaround that worked for their use case: using runner.run(..., catalog=new_catalog)
https://linen-slack.kedro.org/t/16062852/hi-all-how-can-i-update-replace-catalog-entries-from-an-exis#e163f246-de25-405d-9462-7fbc757bf927
Maybe the action item for this ticket is:
DataCatalog
immutabilityIs there a good reason Kedro should start supporting this?
Today I wanted to apply an "advanced" hook use case: storing the catalog and then injecting datasets on the fly. However, it doesn't work:
the
self._catalog
that gets saved receives the.add(..., replace=True)
correctly, but thecatalog.load
that comes immediately after thebefore_dataset_loaded
hook still has the old dataset:https://github.com/kedro-org/kedro/blob/fd8162d5bf384ef666c01ef2c529d01fd9fa8354/kedro/runner/runner.py#L403-L404
Using
after_context_created
gets the same result.Context: I was trying to give a workaround for https://stackoverflow.com/q/76557758/554319.
Is this behavior expected?
Originally posted by @astrojuanlu in https://github.com/kedro-org/kedro/issues/2690#issuecomment-1607746840