Closed deepyaman closed 3 years ago
Hi Deepyaman, thanks for creating this! If I play back what you would like to do, you would like to be able to use different Hooks in Kedro without modifying your code? So Kedro automatically discovers them?
Hey @yetudada! Kedro doesn't need to automatically discover them. For example:
kedro run --hooks src.hookshot.hooks.TeePlugin
or
kedro run --hooks src.hookshot.hooks.TeePlugin,src.hookshot.hooks.CachePlugin
If the project context already had hooks = (ExistingPlugin(),)
defined, the latter would be equivalent to modifying the code to be hooks = (ExistingPlugin(), TeePlugin(), CachePlugin())
.
Using something like a debugging hook shouldn't require a code change.
š Well said. It would be great to be able to run ad-hoc/one-off hooks on the fly without a code change.
For now/alternatively you could use an env variable to toggle what the hook does. This requires you to implement the reading of env variables in your hook, and implementing the hook on the pipeline.
@deepyaman your timing is uncanny. We are building this as we speak š
However, we are going with an opt-out model, hence @yetudada's question. Basically, after user installs a plugin, we will automatically register the plugin's hooks through a dedicated kedro.hooks
entrypoint. This is similar to how global_commands
and project_commands
are working at the moment. This is the pytest's model. We will also provide a mechanism for users to disable hooks that they don't want to run on a per-run basis.
I'm curious what you and @WaylonWalker think about these two models. Do you think an opt-out (auto-discovery) model or an opt-in model (your proposal) would work better?
Off the top of my head, a potential issue I foresee with the auto-discovery model is users who install a bunch of plugin hooks for testing, orāshudderāuse a single base conda env with Kedro installed for all their projects. This isn't a big issue for commands, because it's just your kedro -h
output littered with commands you don't need; however, when it comes to plugins, do they need to uninstall or explicitly disable all the unused hooks? I'm curious what the mechanism to opt-out would be, maybe this isn't such a big deal.
I think opt-in is also easier for my use case, where I want to benchmark raw pipeline vs pipeline + plugin1 vs pipeline + plugin 2, and I like the explicitness, but these are probably smaller concerns from a dev perspective.
Hi @deepyaman thank you for sharing, these are some really interesting questions and comments. I'm adding some of my thoughts, let me know what you make of them.
It feels like the opt-in / opt-out choice really just depends on how easy it is to opt out. Indeed uninstalling stuff would be a hassle. I saw pytest
uses an env variable to blanket-disable auto-discovery, which we considered for Kedro. But they also have another environment variable to specify which plugins to use (or just a variable), so maybe something similar can be mirrored in Kedro (itād be different conf environments for different plugin combinations, so for your benchmarking example you'd have 3 environments). Hooks behaviour is sth that feels too heavy to just be configured at runtime, because it strongly interacts with or change the run flow, to me itād be safer as static config.
Thereās a model in my head that makes me quite reticent about feeding hooks through the CLI, where itās not a developer that messes with the job spec, so the less technical it is, or the less code knowledge you need to understand it, the better. The person on support doesnāt need to understand what hooks to activate, they can just switch to a different environment and job done, if the env is there. If not, a developer should create it, which feels intentional.
I really do like the simplicity of installing pytest plugins and they just work. For the most part though the auto-discovered plugins do not change my test. Typically those only change outside the scope of a single test (im thinking about pytest-watch, parallel runner, coverage, sugar). Things that actually change the execution of my test generally require me to add it in the code by function run, or fixture. I feel that kedro plugins are able to so easily fall into the second category and change the output of your pipeline that it should not be autodiscovered.
As much as we --shutter-- about users with a single base environment it's a common workflow. Much of the learning content focuses very heavily on the details of manipulating DataFrames that most new data scientists are missing out on how to manage their local dev environment, and where to place their solution in the codebase.
I think @lorenabalan brings up a good point about support. If I am running support on an issue I want to quickly get some information about it before making some big code changes. It would be nice to be able to quickly add a retry on fail, or full failure report hook without a change to the prod environment.
Whether this just sits in your teams template (hooks list) and turned off by default, or the framework allows it to be enabled with a flag isn't a big differnence to me. Both require the pre-thought that we may want some extra information from PROD at some point. Both solutions require me to reach into PROD to install the hook, making the code change to add it to the list is not a big deal. note my PROD env is using kedro-docker, other folks with a different prod solution may have a different opinion.
It feels like the opt-in / opt-out choice really just depends on how easy it is to opt out.
Agree that this is of paramount importance.
But they also have another environment variable to specify which plugins to use (or just a variable), so maybe something similar can be mirrored in Kedro
Would be happy with this. It's the current way of specifying hooks plus an alternative solution to this issue, which is to specify the same (almost same, as it seems to overwrite rather than append) list as an environment variable.
(itād be different conf environments for different plugin combinations, so for your benchmarking example you'd have 3 environments).
Eeeh. I don't think I like only depending on this because Kedro only supports one level of inheritance in conf. If I already have a dev
and prod
env, and I want to run my benchmarks for dev
, I need to create dev-hooklist1
, dev-hooklist2
, etc.? Maybe this is just a separate issue, since struggles with conf inheritance aren't unique to this use case.
Thereās a model in my head that makes me quite reticent about feeding hooks through the CLI, where itās not a developer that messes with the job spec, so the less technical it is, or the less code knowledge you need to understand it, the better. The person on support doesnāt need to understand what hooks to activate, they can just switch to a different environment and job done, if the env is there. If not, a developer should create it, which feels intentional.
From a support perspective, this makes sense. The current implementation only allows one set of hooks to be defined across envs, and your aforementioned suggestion to have hooks defined on a per-env basis makes this more flexible.
However, from a dev perspective, I want more control. Creating a new env mirroring a non-base
env isn't trivial, so the process is heavy and annoying. On our last project, we were basically replacing files in config and reloading to get around this (despite already having a lot of copy-paste config at higher levels).
Can we support both? It's not like the extra_hooks
implementation pattern is foreign to Kedro, as it's similar to how extra_params
are handled.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Auto-registration for plugin hooks is one thing, but imagine there's a pipeline with some great-validation
checks in-between different steps, and maybe some additional profiling of the step's output data ā it would be extremely useful to switch those on and off via parameters of the kedro run
command, just like in the original post.
@foxale I just happened to see the email notification for this, but do you either want to reopen this or create a more targeted issue for your use case (and link back here, if necessary)?
Description
Is your feature request related to a problem? A clear and concise description of what the problem is: "I'm always frustrated when ..."
I should be able to run with or without a set of optional hooks. For example:
or
Context
Why is this change important to you? How would you use it? How can it benefit other users?
Not all hooks are core to functionality. Using something like a debugging hook shouldn't require a code change.
This is also helpful for testing different hooks as part of a CI/CD process.
Possible Implementation
See https://github.com/deepyaman/hookshot/commit/0f792f9112031fc2642d30e161c21a0b0c9f539d#diff-ebf803a458716ccda133fadc42e45057 (would add it to
KedroContext
though).