kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
87 stars 77 forks source link

Introduce KedroOperator for Airflow #482

Open datajoely opened 2 years ago

datajoely commented 2 years ago

Discussed in https://github.com/kedro-org/kedro/discussions/1716

Originally posted by **kangshung** July 20, 2022 Hey, are there any plans to add KedroOperator to [an official Airflow provider list](https://github.com/apache/airflow/tree/main/airflow/providers)? This would really make this operator ["official"](https://airflow.apache.org/docs/). KedroSession, if needed, could be probably passed in a different way.
merelcht commented 1 year ago

Let's discuss this as part of Technical Design and talk about what the plan is for kedro-airflow and how the operator would fit in.

jmholzer commented 1 year ago

Technical design discussion 30.11.2022:

  1. Do we want to be an official Airflow provider?

We should figure out direction of Kedro Airflow support, defer answer to this point until then. @idanov to create a new issue related to this.

Pros:

Cons:

  1. What changes would we need to make to kedro-airflow? Currently kedro-airflow/plugin.py creates a *_dag.py file containing KedroOperator (one per Pipeline) using a Jinja2 template. The resulting DAG file is imported to Airflow, where it can then be run. This is different to how the code of other Airflow providers is structured.
noklam commented 1 year ago

To move this to airflow.provdiers, we will have to move the implementation of KedroOperator to the airflow repository.

Questions:

astrojuanlu commented 9 months ago

I'd like to know @marrrcin's and @sbrugman's opinions here. From https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/ for example:

I quickly established my opinion about the quick start setup - the example given there is unpractical, as it is flawed in a few ways that I'd like to avoid in my solution:

  • First, it assumes that Airflow and Kedro know about each other. I would prefer to isolate these two environments so that I don't need to import Kedro in Airflow or Airflow in Kedro. As managing dependencies in Airflow is challenging, it would be better to avoid this problem altogether.
  • From the above it seems that both would have to have similar needs regarding the machine specifications they run on, as they would be executed in the same environment.
  • Thirdly, as the code would be executed by the same processes, it would need to be shared in the form of packages. In this setup Airflow runs in a docker image, so then I'd have to either re-build and re-run this image every time either the Airflow or Kedro project code changes, OR additionally manage lots of virtual python environments somewhere and ship the new versions of the micro-packaged Kedro pipelines there whenever the code changes.

On the other hand, as a non-k8s expert I'd be rather bummed if I had to always use Kubernetes to deploy Kedro on Airflow, and as such I understand that kedro-airflow provides a simpler experience. I know that @sbrugman uses it a lot.

kedro-airflow maintenance at the moment isn't great, and its docs aren't very informative https://github.com/kedro-org/kedro-plugins/issues/394 and I think getting more people to use it (including ourselves) is a prerequisite for this.

marrrcin commented 8 months ago

Yeah, what we've experienced is that current docs for kedro-airflow somehow neglect the "heavy lifting" part - especially when it comes to the Airflow setup (I know that it's not responsibility of Kedro to explain how to manage Airflow) - maybe it would be a good idea to have a warning sign at the top saying that deploying Kedro on Airflow requires some Airflow knowledge anyway and the quickstart is quickstart of "Kedro on Airflow" not "Kedro plus Airflow" 🤔