Problem statement

Users like Kedro because of the ease of getting started for prototyping a data pipeline. A question that we often get from users are:

"I have built a Kedro pipeline. How do I deploy my pipeline in XYZ?"

where XYZ is a various deployment platform, such as Airflow, KuberFlow, AWS Glue, Step functions, Prefect etc.

So far we answer this question in the following ways:

On a case-by-case basis
We provide Kedro-Airflow and Kedro-Docker plugins for some deployment.
External users built custom plugins, such as Kedro-Argo.

Limitations to the above approach are:

For 1, it's not scalable. For 2, Kedro-Airflow offers limited customisations:

It cannot easily use a different Operator other than PythonOperator.
It doesn't work with MemoryDataSet, and users have to manually convert all MemoryDataSet into persistent datasets.

For 3, the community plugins are unofficial, so we cannot guarantee the quality of these external tools.

More users are building a data pipeline with Kedro and want to deploy their pipeline in various deployment platforms, and we need a common interface for these different deployment platforms.

Possible Implementation

When thinking about deployment strategy, there are several key considerations.

We would like to identify and provide general guidance for deployment in popular platforms.
Due to the diversity and heterogeneity of existing deployment platforms, it would be difficult to cover all the we would like to able to provide general guidance to these various deployment platforms as a first iteration.
There should not be a limitation for users to be able to do customisations, such as scheduling interval, configuration, specifying which part of the pipeline to run etc.
We should provide a way to convert MemoryDataSet inputs/outputs into concrete datasets, such as PickleDataSet (related issue: https://github.com/quantumblacklabs/kedro-airflow/issues/46)

General guidance for deployment workflow

As a first iteration, we would also like to provide documentation/blog posts for general guidance for deployment workflow in major platforms.

At high-level, there are 3 different deployments: single-machine deployment, Docker-based clusters, and serverless. In a distributed deployment environment, the proposed workflow works as follows:

Package the entire Kedro pipeline and dependencies (either wheel or dockerising it in an image),
Execute a single node (or part of the pipeline) with kedro pipeline --node=xx. This approach gives users the flexibility to run more than one node in a cluster (e.g kedro pipeline --from-node=xx --to-node=xx) if they wish.

While it might be the case that each node has different dependencies, figuring out what dependencies each node requires would add an extra burden for users to customise, especially for a large pipeline. Thus, we are packaging all the dependencies as a first iteration.

Next steps

As a bare minimum MVP, we are planning to do the followings:

Identify major deployment platforms, and prototype how to deploy our spaceflight project.
We collect our learning and necessary steps and document it, either as part of Kedro documentation or a blog post (similar to how AWS provides a number of blog posts as guidance for how to use their service).

Further extensions

Similar to how we promoted our PySpark docs to a starter, a possible further extension to the documentation would be to provide a template project as a starter, where we provide code/files necessary for users to be able to convert a Kedro pipeline and run in a deployment platform.

The starter could include:

All code/files for building a Kedro pipeline (similar to other starters).
Configuration files for a deployment platform, such as YAML, JSON or Python scripts.
CLI command for generating an interface between Kedro pipeline and deployment pipeline. For example,
- In the case of Kuberflow, this could be a JSON schema for a pipeline definition.
- For Airflow, it would be a Python file for DAG definition, similar to how we generate a DAG file with [Kedro-Airflow].(https://github.com/quantumblacklabs/kedro-airflow/blob/develop/kedro_airflow/dag_template.py)
- This could be scripts to help users to convert the Kedro pipeline into the targeted platform definition of a pipeline code.

Other possible interfaces for deployment are:

Plugins, similar to Kedro-Airflow.
Runner class, similar to ThreadRunner with deploy method.
New abstract class, similar to AbstractDataSet. For example, Prefect provides a Agent class and each deployment platform (AWS, Azure, GCP) override deploy_flow method

Related issue

Kuberflow: https://github.com/quantumblacklabs/kedro/issues/353
AWS Glue https://github.com/quantumblacklabs/kedro/issues/57
Kedro-Airflow https://github.com/quantumblacklabs/kedro-airflow
Kedro-Argo: https://github.com/nraw/kedro-argo
@DmitriiDeriabinQB also prototyped AWS Batch Runner, similar to ThreadRunner.

kedro-org / kedro

Deployment strategy for Kedro pipeline #501