Open datajoely opened 2 years ago
Let's discuss this as part of Technical Design and talk about what the plan is for kedro-airflow
and how the operator would fit in.
Technical design discussion 30.11.2022:
We should figure out direction of Kedro Airflow support, defer answer to this point until then. @idanov to create a new issue related to this.
Pros:
Cons:
kedro-airflow
?
Currently kedro-airflow/plugin.py
creates a *_dag.py
file containing KedroOperator
(one per Pipeline) using a Jinja2 template. The resulting DAG file is imported to Airflow, where it can then be run. This is different to how the code of other Airflow providers is structured.KedroOperator
To move this to airflow.provdiers
, we will have to move the implementation of KedroOperator
to the airflow
repository.
Questions:
KedroOperator
, but not the other hooks
implementation? What's the requirement to become an official provider?I'd like to know @marrrcin's and @sbrugman's opinions here. From https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/ for example:
I quickly established my opinion about the quick start setup - the example given there is unpractical, as it is flawed in a few ways that I'd like to avoid in my solution:
- First, it assumes that Airflow and Kedro know about each other. I would prefer to isolate these two environments so that I don't need to import Kedro in Airflow or Airflow in Kedro. As managing dependencies in Airflow is challenging, it would be better to avoid this problem altogether.
- From the above it seems that both would have to have similar needs regarding the machine specifications they run on, as they would be executed in the same environment.
- Thirdly, as the code would be executed by the same processes, it would need to be shared in the form of packages. In this setup Airflow runs in a docker image, so then I'd have to either re-build and re-run this image every time either the Airflow or Kedro project code changes, OR additionally manage lots of virtual python environments somewhere and ship the new versions of the micro-packaged Kedro pipelines there whenever the code changes.
On the other hand, as a non-k8s expert I'd be rather bummed if I had to always use Kubernetes to deploy Kedro on Airflow, and as such I understand that kedro-airflow
provides a simpler experience. I know that @sbrugman uses it a lot.
kedro-airflow
maintenance at the moment isn't great, and its docs aren't very informative https://github.com/kedro-org/kedro-plugins/issues/394 and I think getting more people to use it (including ourselves) is a prerequisite for this.
Yeah, what we've experienced is that current docs for kedro-airflow
somehow neglect the "heavy lifting" part - especially when it comes to the Airflow setup (I know that it's not responsibility of Kedro to explain how to manage Airflow) - maybe it would be a good idea to have a warning sign at the top saying that deploying Kedro on Airflow requires some Airflow knowledge anyway and the quickstart is quickstart of "Kedro on Airflow" not "Kedro plus Airflow" 🤔
Discussed in https://github.com/kedro-org/kedro/discussions/1716