Closed 921kiyo closed 3 years ago
I'm going to close this issue because the incredible Kedro team completed the first iteration of this. We'll be tracking the docs to see if we continue development for AWS Batch, AWS Sagemaker, Prefect, Kubeflow, Argo and Databricks.
Problem statement
Users like Kedro because of the ease of getting started for prototyping a data pipeline. A question that we often get from users are:
"I have built a Kedro pipeline. How do I deploy my pipeline in XYZ?"
where XYZ is a various deployment platform, such as Airflow, KuberFlow, AWS Glue, Step functions, Prefect etc.
So far we answer this question in the following ways:
Limitations to the above approach are:
For 1, it's not scalable. For 2, Kedro-Airflow offers limited customisations:
For 3, the community plugins are unofficial, so we cannot guarantee the quality of these external tools.
More users are building a data pipeline with Kedro and want to deploy their pipeline in various deployment platforms, and we need a common interface for these different deployment platforms.
Possible Implementation
When thinking about deployment strategy, there are several key considerations.
General guidance for deployment workflow
As a first iteration, we would also like to provide documentation/blog posts for general guidance for deployment workflow in major platforms.
At high-level, there are 3 different deployments: single-machine deployment, Docker-based clusters, and serverless. In a distributed deployment environment, the proposed workflow works as follows:
kedro pipeline --node=xx
. This approach gives users the flexibility to run more than one node in a cluster (e.gkedro pipeline --from-node=xx --to-node=xx
) if they wish.While it might be the case that each node has different dependencies, figuring out what dependencies each node requires would add an extra burden for users to customise, especially for a large pipeline. Thus, we are packaging all the dependencies as a first iteration.
Next steps
As a bare minimum MVP, we are planning to do the followings:
Further extensions
Similar to how we promoted our PySpark docs to a starter, a possible further extension to the documentation would be to provide a template project as a starter, where we provide code/files necessary for users to be able to convert a Kedro pipeline and run in a deployment platform.
The starter could include:
Other possible interfaces for deployment are:
AbstractDataSet
. For example, Prefect provides aAgent
class and each deployment platform (AWS, Azure, GCP) overridedeploy_flow
methodRelated issue