kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.88k stars 896 forks source link

Universal Kedro deployment (Part 3) - Add the ability to extend and distribute the project running logic #1041

Open Galileo-Galilei opened 2 years ago

Galileo-Galilei commented 2 years ago

Preamble

This is the third part of my serie of design documentation on refactoring Kedro to make deployment easier:

Defining the feature: Modifying the running logic and distribute the modifier

Current state of Kedro's extensibility

There are currently several ways to extend Kedro natively, described hereafter:

What is extended Use cases example Kedro object Registration Popularity
Pipeline execution at runtime - change catalog entries on the fly (cache data, change git branch...
- log data remotely (mlflow, neptune, dolt, store kedro-viz static files...)
Hooks (Pipeline, node) - via an entrypoint
- OR manual declaration in settings.py
High: a quick github search shows that many users use hooks to add custom logic at runtime
CLI command - create a configuration file
- profile a catalog entry
- convert a kedro pipeline to an orchestrator
- visualize the pipeline in a webrowser…
plugin click commands - via an entrypoint Medium: Seem to be a more adavced use mainly for plugin developpers
Data sources connection - Create a dataset which can connect to a new data source unsupported by kedro (GBQ, HDF, sklearn pipelines, Databrics, Stata, redis,…. are the most recent ones) AbstractDataSet As a module, which can be imported by its path in the DataCatalog High: a quick search in Kedro's past issues shows that it is very common request for users who need to connect to specific data sources

Use cases not covered by previous mechanisms

However, I've encountered a bunch of use case where people want to extend the running logic (=how to run the pipeline) rather than of the execution logic (=how the pipeline behaves during runtime, which is achieved by hooks). Some examples includes:

  1. Running the entire pipeline several times (e.g. with different set of parameters for hyperparameters tuning (https://github.com/quantumblacklabs/kedro/issues/282#issuecomment-768111744, https://github.com/quantumblacklabs/kedro/discussions/948,https://github.com/Galileo-Galilei/kedro-mlflow/issues/246))
  2. Prepare a conda environment in a different pid before running the pipeline to ensure environment consistency (this is very similar to what "mlflow projects" do)
  3. Perfoms "CI-like checks" (lint...) before running the pipeline, especially when you launch a very long pipeline (this is very similar to what "mlflow projects" do)
  4. Force commiting unstaged changes to ensure reproducibility (this is very similar to what "mlflow projects" do)
  5. Once the pipeline has finished running, expose it as an API (this could be a conveinent way to serve a Kedro Pipeline)
  6. If we offer the community the ability to distribute such changes, I'm pretty sure other use cases will arise 😃

These are real life use-cases which cannot be achieved by hooks because we want to perform operations outside of a KedroSession.

Current workaround pros and cons analysis

Actually, I can think of two ways to achieve previous use cases in Kedro:

These solutions have strong issues:

The best workflow I could came up with to implement such "running logic" changes is the following:

So I can at least reuse my custom runner in other projects by importing them and modifying the other project cli.py, which is not very convenient.

Potential solutions:

A short term solution: Injecting the runner class at runtime

Actually, kedro seems to have all the important elementary bricks to create custom running logic and choose it at runtime: the run command and the AbstractRunner class.

The main default is that we can't easility distribute this logic to other users. I suggest to modify the default run command to be able to flexibly specify the runner at runtime with a similar logic as custom DataSet in the DataCatalog by specifying its path.

https://github.com/quantumblacklabs/kedro/blob/c2c984a260132cdb9c434099485eae05707ad116/kedro/framework/cli/project.py#L351-L392

def run(
    tag,
    env,
    parallel,
    runner,
    is_async,
    node_names,
    to_nodes,
    from_nodes,
    from_inputs,
    to_outputs,
    load_version,
    pipeline,
    config,
    params,
):
    """Run the pipeline."""
    if parallel and runner:
        raise KedroCliError(
            "Both --parallel and --runner options cannot be used together. "
            "Please use either --parallel or --runner."
        )
    runner = runner or "SequentialRunner"
    if parallel:
        runner = "ParallelRunner"

+   runner_prefix = "kedro.runner" if runner in {"SequentialRunner", "ParallelRunner", "ThreadRunner"} else ""
+   runner_class = load_obj(runner, runner_prefix) # eventually "import settings" and load runner configuration from a config file to enable parameterization?
-   runner_class = load_obj(runner, "kedro.runner")  

    tag = _get_values_as_tuple(tag) if tag else tag
    node_names = _get_values_as_tuple(node_names) if node_names else node_names

    with KedroSession.create(env=env, extra_params=params) as session:
        session.run(
            tags=tag,
            runner=runner_class(is_async=is_async),
            node_names=node_names,
            from_nodes=from_nodes,
            to_nodes=to_nodes,
            from_inputs=from_inputs,
            to_outputs=to_outputs,
            load_versions=load_version,
            pipeline_name=pipeline,
        )

Advantages for kedro users:

Towards more flexibility: configure runners in a configuration file

The previous solution does not enable to inject additional parameters to the runner. It "feels" currently poorly managed (there are "if condition" inside the run command to check wether a parameter can be used with the given runner or not...). A solution could be to have a runner.yml file behaving in a catalog-like way to enable parametrization. it would also enable to use the same runner with different parameters. Such a file could look like this:

#runner.yml

my_parallel_runner_async:
    type: ParallelRunner
    is_async: True

my_service_runner:
    type: nice_plugin.runner.ServiceRunner
    host: 127.0.0.1
    port: 5000

my_service_runner2:
    type: nice_plugin.runner.ServiceRunner
    host: 127.0.0.1
    port: 5001

And the run command could resolve a name in this RunnerCatalog and use it in the following fashion:

kedro run --pipeline=my_pipeline --runner=my_service_runner2
datajoely commented 2 years ago

Hey @Galileo-Galilei I haven't had time to read this in detail - but high level the runner is something we want to completely rewrite, but because it works enough today it's not an immediate priority. We recently had a hack sprint on where @jiriklein on our team worked on this topic, from a slightly different perspective.

antonymilne commented 2 years ago

Hello @Galileo-Galilei! Firstly let me say thank you very much for all your thoughts on this and for writing it all up so carefully. Your three posts on Universal Kedro deployment really are a tour de force 🤯 I often go back to re-read and ponder them. There's clearly been a huge amount of time and effort put into it, and it is much appreciated!

Now the good news is that your short term solution of injecting a custom runner class at runtime is actually already possible. I'm not sure whether this is by design or a happy accident - when I first found that this was possible the team were a bit surprised, and so I wrote it up in the docs. It's not a widely known feature anyway.

Why does this already work? It boils down to this line in load_obj, which means that default_obj_path (here "kedro.runner") is ignored whenever obj_path already contains at least one .. So your example --runner=kedro_mlflow.runner.MlflowRunner will already load up the right module. Your example --runner=ServiceRunner would look in kedro.runner.


Some more food for thought: some of our deployment targets (at least AWS batch and Dask) rely on writing a new def run command which includes custom logic to do with the runner. It would be nice if whatever solution we end up with here simplifies these also.

antonymilne commented 2 years ago

I've given this a bit more thought while reviewing https://github.com/kedro-org/kedro/pull/1248/, and I think what you say here makes a lot of sense.

I have a couple of suggested modifications to your proposal. A runner.yml file makes sense, but to me it seems most natural that this would live in conf/env/ and probably just specify a single runner per environment. If you want to configure multiple runners then you can do so using different environments. e.g. your example code would become

# conf/env1/runner.yml
type: ParallelRunner
is_async: True

# conf/env2/runner.yml
type: nice_plugin.runner.ServiceRunner
host: 127.0.0.1
port: 5000

# conf/env3/runner.yml
type: nice_plugin.runner.ServiceRunner
host: 127.0.0.1
port: 5001

My thinking here is:

@Galileo-Galilei what do you think of this? It seems to me there are maybe 3 options here along the lines we're thinking:

  1. Your initial proposal, from what I understand: one global runner.yml file that specifies multiple runners, and you select your runner through --runner. You can specify env and runner independently when you do a kedro run
  2. My above proposal: one runner.yml file in each environment that specifies a single runner. You can specify only env in kedro run, and this determines the runner. This means that if all you want to change is the runner then you need to make a whole new env just to specify that, which is maybe not nice
  3. A combination of 1+2: one runner.yml file in each environment that specifies multiple runners. You can specify env and runner independently when you do a kedro run. This is overall the most powerful since it combines the advantages of config loader hierarchy and different envs but also allows multiple different runners per environment.

Any of these options would make the AWS and Dask runner configuration that I mentioned above much more elegant; there would no longer be a need to create custom run commands for them but instead just a new entry in the runner config.

Galileo-Galilei commented 2 years ago

Hi @AntonyMilneQB,

sorry for the long response delay and thanks for taking time to answer. I'll try to anwser to all your points:

Your three posts on Universal Kedro deployment really are a tour de force 🤯 I often go back to re-read and ponder them. There's clearly been a huge amount of time and effort put into it, and it is much appreciated!

Thank you very much. It is always hard to identify clearly what I want to cover in the issues (and somehow figuring what I do not cover is sometimes even harder), to digest my trials and errors while deploying kedro projects and to create something understandable and hopefully useful for the community. Glad to see you find it give you food for thoughts, even if you do not end up implementing it in the core library.

Now the good news is that your short term solution of injecting a custom runner class at runtime is actually already possible. I'm not sure whether this is by design or a happy accident [...] So your example --runner=kedro_mlflow.runner.MlflowRunner will already load up the right module.

I actually figured this out after I wrote this issue and I was afraid I wrote this up for no reason😅 But it turns that it is not currently possible to pass arguments to the constructor of such a custom runner, which makes this solution hardly usable in practice.

Some more food for thought: some of our deployment targets (at least AWS batch and https://github.com/kedro-org/kedro/pull/1248/) rely on writing a new def run command which includes custom logic to do with the runner.

I am aware of this, and I really advocate against overriding the run command in my team. The main issues described above (lack of composability and flexibility, difficulties for maintenance and distribution) are really a major concern if you have several plugins overriding the run command with a high conflict risk. But I acknowledge it is the best current solution.

I have a couple of suggested modifications to your proposal. A runner.yml file makes sense, but to me it seems most natural that this would live in conf/env/

While reading again my own, issue, i found out that I did not mention this, but in my mind this is completly natural that the runner.yml lives inside the conf/<env>. This is what I implied when writing that it behaves in a "catalog-like way" for the reasons you mention:

It seems to me there are maybe 3 options here along the lines we're thinking:

  1. [...] one global runner.yml file that specifies multiple runners, and you select your runner through --runner
  2. [...] one runner.yml file in each environment that specifies a single runner. You can specify only env in kedro run.
  3. [...] A combination of 1+2: one runner.yml file in each environment that specifies multiple runners. You can specify env and runner independently when you do a kedro run.

Given clarification above, we both agree that 1. is not the right solution. Regarding the 2 remaining solution, I would be inclined to pick up solution 3. The solution 2 seems very elegant (we can argue that getting rid of an argument in the run command may makes user experience easier here), but it feels restrictive.

This is quite common for me to have different environment (e.g. a "debug" one which persists all intermediary datasets, and an "integration" one where I log remotely some datasets instead of logging them locally). I want to be able to run the project in different fashion for the same environment (at least the usual "run" while developing; a "prod like" runner which recreates a virtual environment to test locally how the project would behave in the CI; a "service" runner which serves the pipeline instead of running it to test the project as an API locally). It would increase the maintenance burden if I have to duplicate my environments each time I want to run the project with a different runner.