kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Document more advanced hook use cases #2690

Open astrojuanlu opened 1 year ago

astrojuanlu commented 1 year ago

Description

Hooks are stateful objects, which enables users to, for example, store the context in the after_context_created hook and use it later in a hook that doesn't receive it:

class SomeHook:
    @hook_spec
    def after_context_created(self, context):
       ...
       self.my_azure_config = xxx

    @hook_spec
    def before_pipeline_run(self, pipeline):
       do_something_about_my_pipeline(pipeline, self.my_azure_config)

(code sample by @antonymilne )

We should better document this.

Context

Follow up from gh-506 and other discussions, see for example https://www.linen.dev/s/kedro/t/12112601/does-anyone-know-if-there-is-a-reason-why-we-could-not-pass-#4e36a67f-36d3-4354-a3d5-4347a59ef28f

This is very useful when migrating projects from older versions of Kedro, customize pipeline execution, and more.

noklam commented 1 year ago

I thought I have created a ticket for this but I didnt. Thanks for creating this!

noklam commented 1 year ago

We need to explain WHY user need to do this, and examples of how to do it.

noklam commented 1 year ago

https://github.com/Galileo-Galilei/kedro-mlflow/blob/845ad919c9dbd020e948e8adc2e0f9064de1ef68/kedro_mlflow/framework/hooks/mlflow_hook.py#L50-L63 is a good example. This get asked in the intermediate training, so maybe we can create an example that showcase this.

astrojuanlu commented 1 year ago

Just today I wanted to apply such an "advanced" hook use case: storing the catalog and then injecting datasets on the fly. However, it doesn't work:

class MissingDatasetHooks:
    @hook_impl
    def after_catalog_created(self, catalog: DataCatalog):
        self._catalog = catalog

    @hook_impl
    def before_dataset_loaded(self, dataset_name):
        dataset = self._catalog._get_dataset(dataset_name)
        try:
            dataset.load()
        except DataSetError:
            # Create EmptyDataset on the fly
            logger.warning("Attempted to load dataset %s which doesn't exist yet, injecting it", dataset_name)
            missing_dataset = MissingDataSet(dataset=dataset)
            self._catalog.add(dataset_name, missing_dataset, replace=True)

the self._catalog that gets saved receives the .add(..., replace=True) correctly, but the catalog.load that comes immediately after the before_dataset_loaded hook still has the old dataset:

https://github.com/kedro-org/kedro/blob/fd8162d5bf384ef666c01ef2c529d01fd9fa8354/kedro/runner/runner.py#L403-L404

Context: I was trying to give a workaround for https://stackoverflow.com/q/76557758/554319.

Is this behavior expected?

(Using after_context_created gets the same result)

inigohidalgo commented 10 months ago

I've seen a couple of references to before_pipeline_created, both here and in discord, is this a hook which is available? I can't find reference to it anywhere in the docs.

astrojuanlu commented 10 months ago

Hmmm actually, I'm not sure it ever existed, maybe it's a typo? Does before_pipeline_run or after_catalog_created suit your needs?

inigohidalgo commented 10 months ago

It does, but it was introduced in 0.18.1 and I am on an earlier version. Granted, I am building the hooks to ease our transition to 0.18+, but if there was a hook already implemented which offered a similar API to test the functionality without needing to actually upgrade our project it would've been quicker to test.

Thanks

astrojuanlu commented 10 months ago

Hooks were introduced in 0.16.0 (cc75a1c7fdea6660b987aecd4b99bdd6234187ce), and a few of them later on. Here's the list of hooks in 0.16.6 for example

https://docs.kedro.org/en/0.16.6/07_extend_kedro/04_hooks.html#execution-timeline-hooks