kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

`after_context_created` not the first hook to run, but is fixed when `kedro-telemetry` is uninstalled #2492

Closed melvinkokxw closed 1 year ago

melvinkokxw commented 1 year ago

Description

after_context_created should be the earliest hook triggered (as per Kedro's documentation) but after_catalog_created is triggered before it.

When I uninstalled kedro-telemetry (to fix a separate issue), the hooks are triggered in the expected order.

Raising the bug here instead of the kedro-plugins repo as I cannot confirm if kedro-telemetry causes the bug

Context

I was doing a kedro run and trying to read parameters in a hook, expecting the after_context_created hook to be triggered first. Instead, after_catalog_created was triggered before it.

Steps to Reproduce

  1. Create hooks and add them in settings.py
  2. Do a kedro run

Expected Result

after_context_created hook should be the earliest hook to be triggered

Actual Result

after_catalog_created is triggered before the after_context_created hook

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

from kedro.config import TemplatedConfigLoader

from model_ops.hooks import MLFlowHooks

HOOKS = (MLFlowHooks(),)

CONFIG_LOADER_CLASS = TemplatedConfigLoader

CONFIG_LOADER_ARGS = {
    "config_patterns": {
        "mlflow_catalog": ["mlflow/*_catalog.yml"],
        "mlflow_parameters": ["mlflow/*_parameters.yml"],
    },
    "globals_pattern": "*globals.yml",
}
noklam commented 1 year ago

@melvinkokxw Thanks for reporting this. In a kedro run, the catalog is created by the context so context hook is the earliest triggered hook.

Can you share more how's your hook look like and do you still see the same issue when removed telemetry?

melvinkokxw commented 1 year ago

@noklam Here is the hook I'm using:

class MLFlowHooks:
    """
    Initialise MLFlow and log useful tracking info automatically
    """

    @hook_impl
    def after_context_created(self, context):
        """Loads MLFlow catalog and parameters if MLFlow is enabled

        Args:
            context: Kedro context
        """
        if not self.mlflow_enabled:
            return

        context.config_loader["parameters"] = {
            **context.config_loader["parameters"],
            **context.config_loader["mlflow_parameters"],
        }
        context.config_loader["catalog"] = {
            **context.config_loader["catalog"],
            **context.config_loader["mlflow_catalog"],
        }

    @hook_impl
    def after_catalog_created(self, feed_dict):
        """
        Check if MLFLow is enabled

        Args:
            feed_dict: Kedro feed dict

        Returns:

        """
        parameters = feed_dict["parameters"]

        # MLFlow enabled?
        self.mlflow_enabled = parameters.get("mlflow_enabled", False)

When I removed kedro-telemetry, after_context_created was triggered first. When I reinstall kedro-telemetry, after_catalog_created was triggered first.

astrojuanlu commented 1 year ago

Some debugging:

after_context_created, the one supposed to be the earliest, is called in KedroSession.load_context:

https://github.com/kedro-org/kedro/blob/3439e72ef9ef79095da35db88d0584e684299e09/kedro/framework/session/session.py#L275-L277

after_catalog_created is called by either accessing KedroContext.catalog or calling KedroSession.run:

https://github.com/kedro-org/kedro/blob/3439e72ef9ef79095da35db88d0584e684299e09/kedro/framework/context/context.py#L295-L302

Note that KedroSession.run calls .load_context() (hence triggering after_context_created) before triggering after_catalog_created.

However, kedro-telemetry hooks on before_command_run, which is a CLI hook that runs earlier:

https://github.com/kedro-org/kedro/blob/3439e72ef9ef79095da35db88d0584e684299e09/kedro/framework/cli/cli.py#L134-L145

Working on gh-1942 to better understand what's happening.

noklam commented 1 year ago

I think this was answer in slack but I can't find the thread now.

astrojuanlu commented 1 year ago

https://www.linen.dev/s/kedro/t/10245031/hi-kedro-team-users-i-found-two-unusual-behaviours-with-kedr#6be97170-46ab-4562-8395-425b62f779b8

What happen is this - catalog is a read-only object, everytime context.catalog get called it get created and trigger the after_catalog_created hook. In the telemetry hook after_context_created it created catalog, so it trigger the after_catalog_created before your MLFlowHook’s after_context_created

noklam commented 1 year ago

@melvinkokxw I am closing this for now. The short answer to this is the execution orders are not clear when you have multiple hooks exist. In this case, kedro-telemetry try to read catalog so it triggers after_catalog_created.

What happens roughly

  1. kedro-telemtry after_context_created read catalog
  2. after_catalog_created is triggered
  3. after_context_created is triggered twice in your custom hook.

So your hook isn't really the "first hook". We need to document this better but this is not a bug.

Related issues:

astrojuanlu commented 1 year ago

This goes a bit deeper into the design of KedroContext but maybe, given that it's immutable in develop (#1465), .catalog could be pre-computed, hence after_catalog_created would always be triggered before after_context_created, reducing ambiguity.