kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
84 stars 76 forks source link

kedro-airflow: updating support for Kubernetes #652

Open DimedS opened 2 months ago

DimedS commented 2 months ago

Description

To facilitate running Kedro Airflow on Kubernetes, the kedro-airflow-k8s plugin was developed. However, it only supports versions of Kedro up to 0.18.0, while the current version is 0.19.4. Consequently, we have moved the recommendation to use this plugin to the end of our airflow deployment documentation. We now need to determine the best approach for using Kedro Airflow on Kubernetes going forward.

astrojuanlu commented 2 months ago

@lasica @marrrcin Any thoughts? Are you accepting PRs on getindata/kedro-airflow-k8s?

marrrcin commented 2 months ago

You can use the official one and run on k8s. See https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/

DimedS commented 1 month ago

As I understand:

If I have a Kubernetes Cluster, I can deploy Airflow there using Helm and customise the deployment with a values.yaml file and a custom Docker image to run my Kedro project's DAG. The process involves:

So technically, I don't need anything special to run Kedro on Airflow deployed on a Kubernetes cluster; it's enough to use a DAG created by the kedro-airflow plugin. However, this setup only allows me to run one Kedro project per Airflow deployment. If I want to run multiple projects in the same Airflow deployment, I can use the KubernetesPodOperator() for each Airflow task (i.e., Kedro node). This will execute each task in an isolated, customised container in a separate Kubernetes Pod, with the KubernetesExecutor dynamically managing all these pods.

However, this approach might be inefficient if there are many Kedro nodes, as it will require deploying many containers. It's better to group nodes to reduce the number of tasks, and thus the number of pods.If I understood correctly, additional functionality in the kedro-airflow plugin to help modify your DAG by inserting the KubernetesPodOperator() and KubernetesExecutor parts would be beneficial.

Do you have the same opinion, @marrrcin? Is using the KubernetesPodOperator() for each task a good solution?

marrrcin commented 1 month ago

Hi, so the solution I've linked above (https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/) does exactly that - it either runs N:N \:\ or with grouping N:M \:\. It also allows to use the same Airflow deployment and run multiple Kedro projects within the same instance with full isolation. Imho that's the best approach here. I would say that the default template should encourage to use KubernetesPodOperator.