Open Galileo-Galilei opened 2 years ago
Hey @Galileo-Galilei I haven't had time to read this in detail - but high level the runner is something we want to completely rewrite, but because it works enough today it's not an immediate priority. We recently had a hack sprint on where @jiriklein on our team worked on this topic, from a slightly different perspective.
Hello @Galileo-Galilei! Firstly let me say thank you very much for all your thoughts on this and for writing it all up so carefully. Your three posts on Universal Kedro deployment really are a tour de force 🤯 I often go back to re-read and ponder them. There's clearly been a huge amount of time and effort put into it, and it is much appreciated!
Now the good news is that your short term solution of injecting a custom runner
class at runtime is actually already possible. I'm not sure whether this is by design or a happy accident - when I first found that this was possible the team were a bit surprised, and so I wrote it up in the docs. It's not a widely known feature anyway.
Why does this already work? It boils down to this line in load_obj, which means that default_obj_path
(here "kedro.runner"
) is ignored whenever obj_path
already contains at least one .
. So your example --runner=kedro_mlflow.runner.MlflowRunner
will already load up the right module. Your example --runner=ServiceRunner
would look in kedro.runner
.
Some more food for thought: some of our deployment targets (at least AWS batch and Dask) rely on writing a new def run
command which includes custom logic to do with the runner. It would be nice if whatever solution we end up with here simplifies these also.
I've given this a bit more thought while reviewing https://github.com/kedro-org/kedro/pull/1248/, and I think what you say here makes a lot of sense.
I have a couple of suggested modifications to your proposal. A runner.yml
file makes sense, but to me it seems most natural that this would live in conf/env/
and probably just specify a single runner per environment. If you want to configure multiple runners then you can do so using different environments. e.g. your example code would become
# conf/env1/runner.yml
type: ParallelRunner
is_async: True
# conf/env2/runner.yml
type: nice_plugin.runner.ServiceRunner
host: 127.0.0.1
port: 5000
# conf/env3/runner.yml
type: nice_plugin.runner.ServiceRunner
host: 127.0.0.1
port: 5001
My thinking here is:
--runner
(and associated --async
) flags from kedro run
, and instead runner is specified in the usual way of specifying a run environment, i.e. kedro run -e env
@Galileo-Galilei what do you think of this? It seems to me there are maybe 3 options here along the lines we're thinking:
runner.yml
file that specifies multiple runners, and you select your runner through --runner
. You can specify env
and runner
independently when you do a kedro run
runner.yml
file in each environment that specifies a single runner. You can specify only env
in kedro run
, and this determines the runner. This means that if all you want to change is the runner then you need to make a whole new env
just to specify that, which is maybe not nicerunner.yml
file in each environment that specifies multiple runners. You can specify env
and runner
independently when you do a kedro run
. This is overall the most powerful since it combines the advantages of config loader hierarchy and different env
s but also allows multiple different runners per environment.Any of these options would make the AWS and Dask runner configuration that I mentioned above much more elegant; there would no longer be a need to create custom run
commands for them but instead just a new entry in the runner config.
Hi @AntonyMilneQB,
sorry for the long response delay and thanks for taking time to answer. I'll try to anwser to all your points:
Your three posts on Universal Kedro deployment really are a tour de force 🤯 I often go back to re-read and ponder them. There's clearly been a huge amount of time and effort put into it, and it is much appreciated!
Thank you very much. It is always hard to identify clearly what I want to cover in the issues (and somehow figuring what I do not cover is sometimes even harder), to digest my trials and errors while deploying kedro projects and to create something understandable and hopefully useful for the community. Glad to see you find it give you food for thoughts, even if you do not end up implementing it in the core library.
Now the good news is that your short term solution of injecting a custom runner class at runtime is actually already possible. I'm not sure whether this is by design or a happy accident [...] So your example --runner=kedro_mlflow.runner.MlflowRunner will already load up the right module.
I actually figured this out after I wrote this issue and I was afraid I wrote this up for no reason😅 But it turns that it is not currently possible to pass arguments to the constructor of such a custom runner, which makes this solution hardly usable in practice.
Some more food for thought: some of our deployment targets (at least AWS batch and https://github.com/kedro-org/kedro/pull/1248/) rely on writing a new def run command which includes custom logic to do with the runner.
I am aware of this, and I really advocate against overriding the run
command in my team. The main issues described above (lack of composability and flexibility, difficulties for maintenance and distribution) are really a major concern if you have several plugins overriding the run
command with a high conflict risk. But I acknowledge it is the best current solution.
I have a couple of suggested modifications to your proposal. A runner.yml file makes sense, but to me it seems most natural that this would live in conf/env/
While reading again my own, issue, i found out that I did not mention this, but in my mind this is completly natural that the runner.yml
lives inside the conf/<env>
. This is what I implied when writing that it behaves in a "catalog-like way" for the reasons you mention:
It seems to me there are maybe 3 options here along the lines we're thinking:
- [...] one global runner.yml file that specifies multiple runners, and you select your runner through --runner
- [...] one runner.yml file in each environment that specifies a single runner. You can specify only env in kedro run.
- [...] A combination of 1+2: one runner.yml file in each environment that specifies multiple runners. You can specify env and runner independently when you do a kedro run.
Given clarification above, we both agree that 1. is not the right solution. Regarding the 2 remaining solution, I would be inclined to pick up solution 3. The solution 2 seems very elegant (we can argue that getting rid of an argument in the run command may makes user experience easier here), but it feels restrictive.
This is quite common for me to have different environment (e.g. a "debug" one which persists all intermediary datasets, and an "integration" one where I log remotely some datasets instead of logging them locally). I want to be able to run the project in different fashion for the same environment (at least the usual "run" while developing; a "prod like" runner which recreates a virtual environment to test locally how the project would behave in the CI; a "service" runner which serves the pipeline instead of running it to test the project as an API locally). It would increase the maintenance burden if I have to duplicate my environments each time I want to run the project with a different runner.
Preamble
This is the third part of my serie of design documentation on refactoring Kedro to make deployment easier:
DataCatalog
entries which have a compute/storage backend different than "python / in memory operations" (including SQl, Spark...).KedroSession
.Defining the feature: Modifying the running logic and distribute the modifier
Current state of Kedro's extensibility
There are currently several ways to extend Kedro natively, described hereafter:
- log data remotely (mlflow, neptune, dolt, store kedro-viz static files...)
- OR manual declaration in settings.py
- profile a catalog entry
- convert a kedro pipeline to an orchestrator
- visualize the pipeline in a webrowser…
Use cases not covered by previous mechanisms
However, I've encountered a bunch of use case where people want to extend the running logic (=how to run the pipeline) rather than of the execution logic (=how the pipeline behaves during runtime, which is achieved by hooks). Some examples includes:
These are real life use-cases which cannot be achieved by hooks because we want to perform operations outside of a
KedroSession
.Current workaround pros and cons analysis
Actually, I can think of two ways to achieve previous use cases in Kedro:
cli.py:run
commmand at the project level (or in a plugin) with custom logicAbstractRunner
which contains the execution logic and manually inject it in yourcli.py
at the project level.These solutions have strong issues:
run
from another project or plugin, you have to recode everything at the project level. At least therunner
solution enable to compose logics through inheritance, but it is not easy to maintain.pip install
it and benefits from the new logic; howewever you have to give up the possibility to extend your own cli at the project level; even worse, plugin order import can lead to inconsistent behaviour if several plugins implements a run command.run
command is running in case of concurrrent overriding of the command, it can obfuscate a lot running errors.run
command and the custom one (e.g. you want to run your pipeline normally most of the time while developping, and have another logic sometimes (e.g. one of the ones described above).The best workflow I could came up with to implement such "running logic" changes is the following:
AbstractRunner
cli.py
on a per project basis to use my custom runnerkedro run
.So I can at least reuse my custom
runner
in other projects by importing them and modifying the other projectcli.py
, which is not very convenient.Potential solutions:
A short term solution: Injecting the
runner
class at runtimeActually, kedro seems to have all the important
elementary bricks
to create custom running logic and choose it at runtime: therun
command and theAbstractRunner
class.The main default is that we can't easility distribute this logic to other users. I suggest to modify the default
run
command to be able to flexibly specify the runner at runtime with a similar logic as customDataSet
in theDataCatalog
by specifying its path.https://github.com/quantumblacklabs/kedro/blob/c2c984a260132cdb9c434099485eae05707ad116/kedro/framework/cli/project.py#L351-L392
Advantages for kedro users:
Towards more flexibility: configure runners in a configuration file
The previous solution does not enable to inject additional parameters to the runner. It "feels" currently poorly managed (there are "if condition" inside the run command to check wether a parameter can be used with the given runner or not...). A solution could be to have a
runner.yml
file behaving in a catalog-like way to enable parametrization. it would also enable to use the same runner with different parameters. Such a file could look like this:And the
run
command could resolve a name in thisRunnerCatalog
and use it in the following fashion: