Universal Kedro deployment (Part 3) - Add the ability to extend and distribute the project running logic

Galileo-Galilei commented 2 years ago

Preamble

This is the third part of my serie of design documentation on refactoring Kedro to make deployment easier:

The first part is described in the issue #770 and focuses on refactoring configuration to separate external and applicative configuration.
The second part is described in the issue #904 focuses on DataCatalog entries which have a compute/storage backend different than "python / in memory operations" (including SQl, Spark...).
This third part focuses on the ability to modify the running logic at runtime, outside of the KedroSession.

Defining the feature: Modifying the running logic and distribute the modifier

Current state of Kedro's extensibility

There are currently several ways to extend Kedro natively, described hereafter:

What is extended	Use cases example	Kedro object	Registration	Popularity
Pipeline execution at runtime	- change catalog entries on the fly (cache data, change git branch... - log data remotely (mlflow, neptune, dolt, store kedro-viz static files...)	Hooks (Pipeline, node)	- via an entrypoint - OR manual declaration in settings.py	High: a quick github search shows that many users use hooks to add custom logic at runtime
CLI command	- create a configuration file - profile a catalog entry - convert a kedro pipeline to an orchestrator - visualize the pipeline in a webrowser…	plugin click commands	- via an entrypoint	Medium: Seem to be a more adavced use mainly for plugin developpers
Data sources connection	- Create a dataset which can connect to a new data source unsupported by kedro (GBQ, HDF, sklearn pipelines, Databrics, Stata, redis,…. are the most recent ones)	AbstractDataSet	As a module, which can be imported by its path in the DataCatalog	High: a quick search in Kedro's past issues shows that it is very common request for users who need to connect to specific data sources

Use cases not covered by previous mechanisms

However, I've encountered a bunch of use case where people want to extend the running logic (=how to run the pipeline) rather than of the execution logic (=how the pipeline behaves during runtime, which is achieved by hooks). Some examples includes:

Running the entire pipeline several times (e.g. with different set of parameters for hyperparameters tuning (https://github.com/quantumblacklabs/kedro/issues/282#issuecomment-768111744, https://github.com/quantumblacklabs/kedro/discussions/948,https://github.com/Galileo-Galilei/kedro-mlflow/issues/246))
Prepare a conda environment in a different pid before running the pipeline to ensure environment consistency (this is very similar to what "mlflow projects" do)
Perfoms "CI-like checks" (lint...) before running the pipeline, especially when you launch a very long pipeline (this is very similar to what "mlflow projects" do)
Force commiting unstaged changes to ensure reproducibility (this is very similar to what "mlflow projects" do)
Once the pipeline has finished running, expose it as an API (this could be a conveinent way to serve a Kedro Pipeline)
If we offer the community the ability to distribute such changes, I'm pretty sure other use cases will arise 😃

These are real life use-cases which cannot be achieved by hooks because we want to perform operations outside of a KedroSession.

Current workaround pros and cons analysis

Actually, I can think of two ways to achieve previous use cases in Kedro:

override the cli.py:run commmand at the project level (or in a plugin) with custom logic
create a custom runner inheriting from AbstractRunner which contains the execution logic and manually inject it in your cli.py at the project level.

These solutions have strong issues:

lack of composability: if you want to compose 2 logic you cannot just import the run from another project or plugin, you have to recode everything at the project level. At least the runner solution enable to compose logics through inheritance, but it is not easy to maintain.
difficulty of distribution: if you create a run command in a plugin, you can pip install it and benefits from the new logic; howewever you have to give up the possibility to extend your own cli at the project level; even worse, plugin order import can lead to inconsistent behaviour if several plugins implements a run command.
difficulty of maintenance: since it is hard to know which run command is running in case of concurrrent overriding of the command, it can obfuscate a lot running errors.
lack of flexibility: You can have a single running logic in your project, while you often need to switch between kedro's default run command and the custom one (e.g. you want to run your pipeline normally most of the time while developping, and have another logic sometimes (e.g. one of the ones described above).

The best workflow I could came up with to implement such "running logic" changes is the following:

Create a custom AbstractRunner
Modify the cli.py on a per project basis to use my custom runner
Create several different very similar commands (run, run_serve, run_pre_conda...) with duplicated code to run the session, each one with a different running logic, so I can pick up the one I want when running kedro run.

So I can at least reuse my custom runner in other projects by importing them and modifying the other project cli.py, which is not very convenient.

Potential solutions:

A short term solution: Injecting the `runner` class at runtime

Actually, kedro seems to have all the important elementary bricks to create custom running logic and choose it at runtime: the run command and the AbstractRunner class.

The main default is that we can't easility distribute this logic to other users. I suggest to modify the default run command to be able to flexibly specify the runner at runtime with a similar logic as custom DataSet in the DataCatalog by specifying its path.

https://github.com/quantumblacklabs/kedro/blob/c2c984a260132cdb9c434099485eae05707ad116/kedro/framework/cli/project.py#L351-L392

def run(
    tag,
    env,
    parallel,
    runner,
    is_async,
    node_names,
    to_nodes,
    from_nodes,
    from_inputs,
    to_outputs,
    load_version,
    pipeline,
    config,
    params,
):
    """Run the pipeline."""
    if parallel and runner:
        raise KedroCliError(
            "Both --parallel and --runner options cannot be used together. "
            "Please use either --parallel or --runner."
        )
    runner = runner or "SequentialRunner"
    if parallel:
        runner = "ParallelRunner"

+   runner_prefix = "kedro.runner" if runner in {"SequentialRunner", "ParallelRunner", "ThreadRunner"} else ""
+   runner_class = load_obj(runner, runner_prefix) # eventually "import settings" and load runner configuration from a config file to enable parameterization?
-   runner_class = load_obj(runner, "kedro.runner")  

    tag = _get_values_as_tuple(tag) if tag else tag
    node_names = _get_values_as_tuple(node_names) if node_names else node_names

    with KedroSession.create(env=env, extra_params=params) as session:
        session.run(
            tags=tag,
            runner=runner_class(is_async=is_async),
            node_names=node_names,
            from_nodes=from_nodes,
            to_nodes=to_nodes,
            from_inputs=from_inputs,
            to_outputs=to_outputs,
            load_versions=load_version,
            pipeline_name=pipeline,
        )

Advantages for kedro users:

This would enable to use the same command to inject my running logic at runtime, e.g.:

kedro run --pipeline=my-pipeline # normal use
kedro run --pipeline=my-pipeline --runner=kedro_mlflow.runner.MlflowRunner # use mlflow projects to create a conda env, clean git history, performs check before running 
kedro run --pipeline=my-pipeline --runner=ServiceRunner # Serve my pipeline after running

it would make transition to production very easy if you want to have different logic for e.g. serving the model or processing a batch.
this implementation is completely backward-compatible with kedro's running logic and completely straightforward to add to the codebase.
the logic is very easy to distribute: anyone can use my custome runner just with module path.

Towards more flexibility: configure runners in a configuration file

The previous solution does not enable to inject additional parameters to the runner. It "feels" currently poorly managed (there are "if condition" inside the run command to check wether a parameter can be used with the given runner or not...). A solution could be to have a runner.yml file behaving in a catalog-like way to enable parametrization. it would also enable to use the same runner with different parameters. Such a file could look like this:

#runner.yml

my_parallel_runner_async:
    type: ParallelRunner
    is_async: True

my_service_runner:
    type: nice_plugin.runner.ServiceRunner
    host: 127.0.0.1
    port: 5000

my_service_runner2:
    type: nice_plugin.runner.ServiceRunner
    host: 127.0.0.1
    port: 5001

And the run command could resolve a name in this RunnerCatalog and use it in the following fashion:

kedro run --pipeline=my_pipeline --runner=my_service_runner2

datajoely commented 2 years ago

Hey @Galileo-Galilei I haven't had time to read this in detail - but high level the runner is something we want to completely rewrite, but because it works enough today it's not an immediate priority. We recently had a hack sprint on where @jiriklein on our team worked on this topic, from a slightly different perspective.

antonymilne commented 2 years ago

Hello @Galileo-Galilei! Firstly let me say thank you very much for all your thoughts on this and for writing it all up so carefully. Your three posts on Universal Kedro deployment really are a tour de force 🤯 I often go back to re-read and ponder them. There's clearly been a huge amount of time and effort put into it, and it is much appreciated!

Now the good news is that your short term solution of injecting a custom runner class at runtime is actually already possible. I'm not sure whether this is by design or a happy accident - when I first found that this was possible the team were a bit surprised, and so I wrote it up in the docs. It's not a widely known feature anyway.

Why does this already work? It boils down to this line in load_obj, which means that default_obj_path (here "kedro.runner") is ignored whenever obj_path already contains at least one .. So your example --runner=kedro_mlflow.runner.MlflowRunner will already load up the right module. Your example --runner=ServiceRunner would look in kedro.runner.

Some more food for thought: some of our deployment targets (at least AWS batch and Dask) rely on writing a new def run command which includes custom logic to do with the runner. It would be nice if whatever solution we end up with here simplifies these also.

antonymilne commented 2 years ago

I've given this a bit more thought while reviewing https://github.com/kedro-org/kedro/pull/1248/, and I think what you say here makes a lot of sense.

I have a couple of suggested modifications to your proposal. A runner.yml file makes sense, but to me it seems most natural that this would live in conf/env/ and probably just specify a single runner per environment. If you want to configure multiple runners then you can do so using different environments. e.g. your example code would become

# conf/env1/runner.yml
type: ParallelRunner
is_async: True

# conf/env2/runner.yml
type: nice_plugin.runner.ServiceRunner
host: 127.0.0.1
port: 5000

# conf/env3/runner.yml
type: nice_plugin.runner.ServiceRunner
host: 127.0.0.1
port: 5001

My thinking here is:

specifying a runner is runtime configuration, and so it belongs in the conf folder
we can use exactly the same config loader logic as for parameters, catalog, logging and spark config
this is consistent and easy to understand and also means you can benefit from stuff like local overriding base and specifying a different runner for different run environments
we could abolish the --runner (and associated --async) flags from kedro run, and instead runner is specified in the usual way of specifying a run environment, i.e. kedro run -e env

@Galileo-Galilei what do you think of this? It seems to me there are maybe 3 options here along the lines we're thinking:

Your initial proposal, from what I understand: one global runner.yml file that specifies multiple runners, and you select your runner through --runner. You can specify env and runner independently when you do a kedro run
My above proposal: one runner.yml file in each environment that specifies a single runner. You can specify only env in kedro run, and this determines the runner. This means that if all you want to change is the runner then you need to make a whole new env just to specify that, which is maybe not nice
A combination of 1+2: one runner.yml file in each environment that specifies multiple runners. You can specify env and runner independently when you do a kedro run. This is overall the most powerful since it combines the advantages of config loader hierarchy and different envs but also allows multiple different runners per environment.

Any of these options would make the AWS and Dask runner configuration that I mentioned above much more elegant; there would no longer be a need to create custom run commands for them but instead just a new entry in the runner config.

Galileo-Galilei commented 2 years ago

Hi @AntonyMilneQB,

sorry for the long response delay and thanks for taking time to answer. I'll try to anwser to all your points:

Your three posts on Universal Kedro deployment really are a tour de force 🤯 I often go back to re-read and ponder them. There's clearly been a huge amount of time and effort put into it, and it is much appreciated!

Thank you very much. It is always hard to identify clearly what I want to cover in the issues (and somehow figuring what I do not cover is sometimes even harder), to digest my trials and errors while deploying kedro projects and to create something understandable and hopefully useful for the community. Glad to see you find it give you food for thoughts, even if you do not end up implementing it in the core library.

Now the good news is that your short term solution of injecting a custom runner class at runtime is actually already possible. I'm not sure whether this is by design or a happy accident [...] So your example --runner=kedro_mlflow.runner.MlflowRunner will already load up the right module.

I actually figured this out after I wrote this issue and I was afraid I wrote this up for no reason😅 But it turns that it is not currently possible to pass arguments to the constructor of such a custom runner, which makes this solution hardly usable in practice.

Some more food for thought: some of our deployment targets (at least AWS batch and https://github.com/kedro-org/kedro/pull/1248/) rely on writing a new def run command which includes custom logic to do with the runner.

I am aware of this, and I really advocate against overriding the run command in my team. The main issues described above (lack of composability and flexibility, difficulties for maintenance and distribution) are really a major concern if you have several plugins overriding the run command with a high conflict risk. But I acknowledge it is the best current solution.

I have a couple of suggested modifications to your proposal. A runner.yml file makes sense, but to me it seems most natural that this would live in conf/env/

While reading again my own, issue, i found out that I did not mention this, but in my mind this is completly natural that the runner.yml lives inside the conf/<env>. This is what I implied when writing that it behaves in a "catalog-like way" for the reasons you mention:

this is where configuration belong, so it won't confuse users
we want to benefit from existing mechanism for loading config files, including overriding environments (which is soooo useful) and not reinvent the wheel

It seems to me there are maybe 3 options here along the lines we're thinking:

[...] one global runner.yml file that specifies multiple runners, and you select your runner through --runner

[...] one runner.yml file in each environment that specifies a single runner. You can specify only env in kedro run.

[...] A combination of 1+2: one runner.yml file in each environment that specifies multiple runners. You can specify env and runner independently when you do a kedro run.

Given clarification above, we both agree that 1. is not the right solution. Regarding the 2 remaining solution, I would be inclined to pick up solution 3. The solution 2 seems very elegant (we can argue that getting rid of an argument in the run command may makes user experience easier here), but it feels restrictive.

This is quite common for me to have different environment (e.g. a "debug" one which persists all intermediary datasets, and an "integration" one where I log remotely some datasets instead of logging them locally). I want to be able to run the project in different fashion for the same environment (at least the usual "run" while developing; a "prod like" runner which recreates a virtual environment to test locally how the project would behave in the CI; a "service" runner which serves the pipeline instead of running it to test the project as an API locally). It would increase the maintenance burden if I have to duplicate my environments each time I want to run the project with a different runner.

kedro-org / kedro