kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Deployment strategy for Kedro pipeline #501

Closed 921kiyo closed 3 years ago

921kiyo commented 3 years ago

Problem statement

Users like Kedro because of the ease of getting started for prototyping a data pipeline. A question that we often get from users are:

"I have built a Kedro pipeline. How do I deploy my pipeline in XYZ?"

where XYZ is a various deployment platform, such as Airflow, KuberFlow, AWS Glue, Step functions, Prefect etc.

So far we answer this question in the following ways:

  1. On a case-by-case basis
  2. We provide Kedro-Airflow and Kedro-Docker plugins for some deployment.
  3. External users built custom plugins, such as Kedro-Argo.

Limitations to the above approach are:

For 1, it's not scalable. For 2, Kedro-Airflow offers limited customisations:

For 3, the community plugins are unofficial, so we cannot guarantee the quality of these external tools.

More users are building a data pipeline with Kedro and want to deploy their pipeline in various deployment platforms, and we need a common interface for these different deployment platforms.

Possible Implementation

When thinking about deployment strategy, there are several key considerations.

General guidance for deployment workflow

As a first iteration, we would also like to provide documentation/blog posts for general guidance for deployment workflow in major platforms.

At high-level, there are 3 different deployments: single-machine deployment, Docker-based clusters, and serverless. In a distributed deployment environment, the proposed workflow works as follows:

While it might be the case that each node has different dependencies, figuring out what dependencies each node requires would add an extra burden for users to customise, especially for a large pipeline. Thus, we are packaging all the dependencies as a first iteration.

Next steps

As a bare minimum MVP, we are planning to do the followings:

Further extensions

Similar to how we promoted our PySpark docs to a starter, a possible further extension to the documentation would be to provide a template project as a starter, where we provide code/files necessary for users to be able to convert a Kedro pipeline and run in a deployment platform.

The starter could include:

Other possible interfaces for deployment are:

Related issue

yetudada commented 3 years ago

I'm going to close this issue because the incredible Kedro team completed the first iteration of this. We'll be tracking the docs to see if we continue development for AWS Batch, AWS Sagemaker, Prefect, Kubeflow, Argo and Databricks.