kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.88k stars 897 forks source link

Allow extra hooks to be passed via the Kedro CLI #435

Closed deepyaman closed 3 years ago

deepyaman commented 4 years ago

Description

Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."

I should be able to run with or without a set of optional hooks. For example:

kedro run --hooks src.hookshot.hooks.TeePlugin

or

kedro run --hooks src.hookshot.hooks.TeePlugin,src.hookshot.hooks.CachePlugin

Context

Why is this change important to you? How would you use it? How can it benefit other users?

Not all hooks are core to functionality. Using something like a debugging hook shouldn't require a code change.

This is also helpful for testing different hooks as part of a CI/CD process.

Possible Implementation

See https://github.com/deepyaman/hookshot/commit/0f792f9112031fc2642d30e161c21a0b0c9f539d#diff-ebf803a458716ccda133fadc42e45057 (would add it to KedroContext though).

yetudada commented 4 years ago

Hi Deepyaman, thanks for creating this! If I play back what you would like to do, you would like to be able to use different Hooks in Kedro without modifying your code? So Kedro automatically discovers them?

deepyaman commented 4 years ago

Hey @yetudada! Kedro doesn't need to automatically discover them. For example:

kedro run --hooks src.hookshot.hooks.TeePlugin

or

kedro run --hooks src.hookshot.hooks.TeePlugin,src.hookshot.hooks.CachePlugin

If the project context already had hooks = (ExistingPlugin(),) defined, the latter would be equivalent to modifying the code to be hooks = (ExistingPlugin(), TeePlugin(), CachePlugin()).

WaylonWalker commented 4 years ago

Using something like a debugging hook shouldn't require a code change.

šŸ‘ Well said. It would be great to be able to run ad-hoc/one-off hooks on the fly without a code change.

For now/alternatively you could use an env variable to toggle what the hook does. This requires you to implement the reading of env variables in your hook, and implementing the hook on the pipeline.

limdauto commented 4 years ago

@deepyaman your timing is uncanny. We are building this as we speak šŸ˜‚

However, we are going with an opt-out model, hence @yetudada's question. Basically, after user installs a plugin, we will automatically register the plugin's hooks through a dedicated kedro.hooks entrypoint. This is similar to how global_commands and project_commands are working at the moment. This is the pytest's model. We will also provide a mechanism for users to disable hooks that they don't want to run on a per-run basis.

I'm curious what you and @WaylonWalker think about these two models. Do you think an opt-out (auto-discovery) model or an opt-in model (your proposal) would work better?

deepyaman commented 4 years ago

Off the top of my head, a potential issue I foresee with the auto-discovery model is users who install a bunch of plugin hooks for testing, orā€”shudderā€”use a single base conda env with Kedro installed for all their projects. This isn't a big issue for commands, because it's just your kedro -h output littered with commands you don't need; however, when it comes to plugins, do they need to uninstall or explicitly disable all the unused hooks? I'm curious what the mechanism to opt-out would be, maybe this isn't such a big deal.

I think opt-in is also easier for my use case, where I want to benchmark raw pipeline vs pipeline + plugin1 vs pipeline + plugin 2, and I like the explicitness, but these are probably smaller concerns from a dev perspective.

lorenabalan commented 4 years ago

Hi @deepyaman thank you for sharing, these are some really interesting questions and comments. I'm adding some of my thoughts, let me know what you make of them.

It feels like the opt-in / opt-out choice really just depends on how easy it is to opt out. Indeed uninstalling stuff would be a hassle. I saw pytest uses an env variable to blanket-disable auto-discovery, which we considered for Kedro. But they also have another environment variable to specify which plugins to use (or just a variable), so maybe something similar can be mirrored in Kedro (itā€™d be different conf environments for different plugin combinations, so for your benchmarking example you'd have 3 environments). Hooks behaviour is sth that feels too heavy to just be configured at runtime, because it strongly interacts with or change the run flow, to me itā€™d be safer as static config.

Thereā€™s a model in my head that makes me quite reticent about feeding hooks through the CLI, where itā€™s not a developer that messes with the job spec, so the less technical it is, or the less code knowledge you need to understand it, the better. The person on support doesnā€™t need to understand what hooks to activate, they can just switch to a different environment and job done, if the env is there. If not, a developer should create it, which feels intentional.

WaylonWalker commented 4 years ago

I really do like the simplicity of installing pytest plugins and they just work. For the most part though the auto-discovered plugins do not change my test. Typically those only change outside the scope of a single test (im thinking about pytest-watch, parallel runner, coverage, sugar). Things that actually change the execution of my test generally require me to add it in the code by function run, or fixture. I feel that kedro plugins are able to so easily fall into the second category and change the output of your pipeline that it should not be autodiscovered.

As much as we --shutter-- about users with a single base environment it's a common workflow. Much of the learning content focuses very heavily on the details of manipulating DataFrames that most new data scientists are missing out on how to manage their local dev environment, and where to place their solution in the codebase.

I think @lorenabalan brings up a good point about support. If I am running support on an issue I want to quickly get some information about it before making some big code changes. It would be nice to be able to quickly add a retry on fail, or full failure report hook without a change to the prod environment.

Whether this just sits in your teams template (hooks list) and turned off by default, or the framework allows it to be enabled with a flag isn't a big differnence to me. Both require the pre-thought that we may want some extra information from PROD at some point. Both solutions require me to reach into PROD to install the hook, making the code change to add it to the list is not a big deal. note my PROD env is using kedro-docker, other folks with a different prod solution may have a different opinion.

deepyaman commented 4 years ago

It feels like the opt-in / opt-out choice really just depends on how easy it is to opt out.

Agree that this is of paramount importance.

But they also have another environment variable to specify which plugins to use (or just a variable), so maybe something similar can be mirrored in Kedro

Would be happy with this. It's the current way of specifying hooks plus an alternative solution to this issue, which is to specify the same (almost same, as it seems to overwrite rather than append) list as an environment variable.

(itā€™d be different conf environments for different plugin combinations, so for your benchmarking example you'd have 3 environments).

Eeeh. I don't think I like only depending on this because Kedro only supports one level of inheritance in conf. If I already have a dev and prod env, and I want to run my benchmarks for dev, I need to create dev-hooklist1, dev-hooklist2, etc.? Maybe this is just a separate issue, since struggles with conf inheritance aren't unique to this use case.

Thereā€™s a model in my head that makes me quite reticent about feeding hooks through the CLI, where itā€™s not a developer that messes with the job spec, so the less technical it is, or the less code knowledge you need to understand it, the better. The person on support doesnā€™t need to understand what hooks to activate, they can just switch to a different environment and job done, if the env is there. If not, a developer should create it, which feels intentional.

From a support perspective, this makes sense. The current implementation only allows one set of hooks to be defined across envs, and your aforementioned suggestion to have hooks defined on a per-env basis makes this more flexible.

However, from a dev perspective, I want more control. Creating a new env mirroring a non-base env isn't trivial, so the process is heavy and annoying. On our last project, we were basically replacing files in config and reloading to get around this (despite already having a lot of copy-paste config at higher levels).

Can we support both? It's not like the extra_hooks implementation pattern is foreign to Kedro, as it's similar to how extra_params are handled.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

foxale commented 2 years ago

Auto-registration for plugin hooks is one thing, but imagine there's a pipeline with some great-validation checks in-between different steps, and maybe some additional profiling of the step's output data ā€” it would be extremely useful to switch those on and off via parameters of the kedro run command, just like in the original post.

deepyaman commented 2 years ago

@foxale I just happened to see the email notification for this, but do you either want to reopen this or create a more targeted issue for your use case (and link back here, if necessary)?