kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Hooks execution order is unclear when multiple hooks exists #2493

Closed noklam closed 10 months ago

noklam commented 1 year ago

Description

Short description of the problem here. A user report this in https://kedro-org.slack.com/archives/C03RKP2LW64/p1680617722865809?thread_ts=1680605927.073699&cid=C03RKP2LW64

In this case, the user's hook is interfering with kedro-telemetry, we use after_context_hook , catalog = context.catalog which collect some summary statistics about the project stats.

Context

How has this bug affected you? What were you trying to accomplish? It is confusing as the execution order is not linear, it's difficult to reason & debug once you have more than one hook.

We don't encourage hooks to communicate with each other, but they can via storing state in self.xxx, if it's not managed properly they can interfere with each other and cause obscure bug.

If users need to have finer control the order of hooks, currently user can define it explicitly in settings.py, but this doesn't solve the problem completely

Expected Result

Tell us what should happen. Normally you expected after_context_created hook is the earliest hook to be triggered.

Actual Result

Tell us what happens instead. When you have more than one hooks. In this case, this is what happened

  1. telemetry's after_context_created triggered -> it create a catalog within the hook
  2. after_catalog_created is trigger for the 1st time
  3. User's custom hook after_context_created triggered
  4. after_catalog_created is trigger for the 2nd time.
-- If you received an error, place it here.
-- Separate them if you have more than one.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

merelcht commented 1 year ago

Do some testing to find out what's really going on here and decide if it can be fixed. If not, write it up in docs so at least users know it's expected behaviour.

merelcht commented 1 year ago

See also https://github.com/kedro-org/kedro/issues/2492 (the same user who raised the issue on Slack)

astrojuanlu commented 1 year ago

Potentially useful:

The call time order is LIFO relative to registration as mentioned in the docs.

https://github.com/pytest-dev/pluggy/issues/250#issuecomment-578542612

noklam commented 1 year ago

@astrojuanlu I believe that's true - HOOKS is also LIFO, however, it becomes less clear when the hook is a pip-installable one. i.e. kedro-telemetry.

astrojuanlu commented 1 year ago

If I understand correctly, there are no guarantees about run order for plugin (entry points) hooks. That order can be influenced (but not forced) passing tryfirst=True or trylast=True to the HookimplMarker:

setuptools plugins (in arbitrary order, except those marked with try_first and try_last);

https://github.com/pytest-dev/pluggy/issues/64#issue-252704455

(docs: https://pluggy.readthedocs.io/en/stable/index.html#call-time-order)

I don't think it's properly documented upstream, but we could document it ourselves.

noklam commented 1 year ago

https://kedro-org.slack.com/archives/C03RKP2LW64/p1685454486286959

Another question came up recently, maybe it's a strong enough signal that we should fix the docs. The thread also mention 2 issues that included some advance examples of creating stateful hooks. We should consider adding that too.

noklam commented 10 months ago

Maybe we can close this because https://github.com/kedro-org/kedro-plugins/pull/422 is merged.

astrojuanlu commented 10 months ago

Indeed: https://github.com/getindata/kedro-azureml/issues/79

Closing! Hope we can tackle #1718 soon