[Q] Kedro + Kubeflow = ?

jaklan commented 4 years ago

Hi guys, at the very beginning - huge kudos for all the work you have already put in Kedro development. The tool is really great, but I'd like to understand one more thing:

how do you see the future of Kedro in the context of Kubeflow (Pipelines) and related tools like e.g. Rok or Kale?

Do you find Kedro + Airflow an alternative solution? Or maybe do you see a place for Kedro in some end-to-end Kubeflow workflow? If the latter, is that sth you already have on your roadmap? Or maybe you just have some initial vision, how such an integration could look like?

yetudada commented 4 years ago

Hi @jaklan! Thanks for raising this question.

Let me work my way through your response but I want to start by asking what do you like most about Kedro and how long have you been using it?

Then you've actually hit someone that we're currently building towards. This work is on the H2 roadmap and. What we're trying to do is give you a way to work in Kedro but have different deployment targets based on what you're trying to do.

Kedro-Airflow was our first foray into this world because users had a need to work in Airflow, but preferred to work in Kedro. We'll be looking at a few open-source tools like Argo, Kubeflow, Prefect, as well as cloud-based tooling like AWS Glue and more. We'll also check out Rok and Kale because of your recommendation. What has your experience been like with Kubeflow?

jaklan commented 4 years ago

@yetudada thanks for the answer and sorry for a late response, we had a busy time.

Together with @kaemo we are core members of a team that is responsible for MLOps adoption in our company - in general we define standards & guidelines how to conduct end-to-end Data Science projects.

Until now, we were working on our in-house framework to reassure the quality of our PoCs, incl. Project Template (Cookiecutter), CLI (Click), Code Quality (Black, flake8, isort etc. as pre-commit hooks), Data Management & Quality (GoodTables, DVC), Experiment Tracking & Model Management (MLflow), Containerization (Docker), Pipeline Management etc., so as you can see - something very close to Kedro.

Because of so many design similarities and Kedro being more mature, open source (our codebase was inner source), and with an already existing community - we decided to swap our solution with Kedro and start contributing to it (@kaemo is already working on a kedro-mlflow plugin), so we can focus more on the productionization phase.

In the aftermath, we are currently evaluating different ways and scenarios on how to move from the PoC phase to the production phase smoothly. One of them is using Kedro during the PoC phase and then translating its pipelines into Kubeflow Pipelines if we decide to go prod (so when a proper pipeline orchestrator would be needed).

Why such a workflow? Developing PoCs in Kedro would give us a proper level of standardisation and quality, and would accustom our Data Scientists to Nodes & Pipelines concept - and we found it easier to achieve with Kedro than with Kubeflow at that phase (K8s & resources overhead during local / on-prem development is quite overwhelming in our opinion).

What about the prod phase then? We have a feeling the high-level concepts of Kedro Pipelines are very similar to KF Pipelines. And based on that we’re reaching the original question - do you evaluate a possibility of such an integration / translation and do you think that makes sense in general? We believe that it should be doable (e.g. we have seen a Kedro extension to make pipelines definition in YAML possible), but we can also imagine some potential issues (e.g. Kedro Context, Data Catalog concepts?). But of course at the end of the day the most important issue is your vision of the Kedro future and the key milestones on your roadmap. We are not even sure if such a transition into prod-level solutions is in your current scope? And if yes - do you find Airflow integration the preferred / default one for the nearest future or do you have a different long-term vision?

From our perspective - answering the above questions is crucial in terms of further Kedro adoption & support.

At the end, It's also worth answering one question from our side - why Kubeflow and not e.g. Airflow? Although Airflow is a mature project, has a huge community and is successfully used in data engineering scenarios, we believe it is not well suited to handle Machine Learning workflows. Some of the problems are:

no easy way to run DAGs off-schedule or with no schedule at all
no way to start DAGs concurrently with the same start time
designed for long-running tasks but some tasks in ML pipelines are very fast
no first-class support for data exchange between nodes
problematic or impossible parametrization, dynamic node creation and versioning of DAGs

More details you can find in a very good post titled Why not Airflow? - it elaborates on those problems and indicates a few more.

yetudada commented 4 years ago

Hi @jaklan, it's my turn to apologise for the late reply to you and @kaemo.

Your internal framework sounds a lot like Kedro! I think you walked into the same problems that we did. I'm glad you picked up Kedro in the end because you're helping us make it better. Just a note to @kaemo, this week we begin a sprint to open-source an internal tool that does experiment tracking. It's built on top of MLflow; I'll circle back to this thread when we finally have it out. And a question for you, why did you use DVC and MLFlow?

Now, for how you're thinking about your workflow is how we've pictured fitting into the ecosystem. We see ourselves as a tool that helps you create something that can be orchestrated. To that point, we have something on our roadmap called "The Universal Deployer", which will allow us to easily map our pipeline abstraction into a task-based workflow and therefore make it easier to deploy it to KubeFlow, Argo, Prefect, AWS Batch and more. The team has managed to prototype the Prefect and AWS Batch iterations of this. The Kedro-Airflow plugin will also need to be rebuilt with this new system in mind. KubeFlow, therefore, is on our roadmap.

We are aware of the problems you've indicated with Airflow. Airflow is challenging to use, and the reason the Kedro-Airflow plugin was built was because our teams wanted the ease of working in Kedro and then could choose to deploy Airflow plugin for clients if required.

But I'm observing industry trends; I think we're starting to move past Airflow. There are exciting developments for new orchestrators.

yetudada commented 4 years ago

Also, if you and @kaemo want to spend time with us so that we can learn about:

How you've used and perhaps extended Kedro
What challenges you may have faced using Kedro, so that we can address these things

Then just fill out this link. We'll reach out and host a hangout.

jaklan commented 3 years ago

Hi @yetudada , I've just filled the form above. We would like to discuss especially the "Universal Deployer", because we have currently active discussions about the future usage of Kedro in the context of productionisation. At this point we see a need to work on kedro-argo plugin to make the integration more fluent and having that - to push things towards an integration with Kubeflow, but we also wonder if it's a proper direction - the plugin is not developed by your team and it can potentially be incompatible with the "Universal Deployer", especially in the context of building the integration with Kubeflow based on that, so we would like to talk about your vision & next steps before we allocate our time improperly. Is there a chance to have that meeting quite soon (even this week)?

yetudada commented 3 years ago

We're so excited to meet you! 🚀 Talk soon!

yetudada commented 3 years ago

So what we've done with this ticket is taken our first stab at an MVP for this. We've created documentation on how to deploy Kedro on Kubeflow. We'll be tracking this page to see how much interest there is in this and then we'd build for it, if there's enough demand. There's also a guideline for Argo too.

szczeles commented 3 years ago

@jaklan @yetudada Hey! You may want to check the kedro-kubeflow (docs) plugin that we're developing at GetInData. It schedules Kedro nodes as a separate docker containers running in Kubeflow Pipelines. For now it solves most of the issues that @jaklan listed:

the pipeline can be run one-off or on schedule (the runs are separated)
multiple pipelines can run in parallel
the data exchange is done by attaching a temporary volume to all the nodes and publishing the outputs as artifacts (so they can downloaded from Pipelines UI)
it is possible to overwrite the parameters during the run (by default it takes values from conf/base/parameters.yaml)
also: we have support to log all the parameters/metrics to Mlflow under one run (by sharing the MLFLOW_RUN_ID between the containers)

If you'd like to experiment, we have a quickstart scenario described: https://kedro-kubeflow.readthedocs.io/en/0.3.0/source/03_getting_started/01_quickstart.html

workflow

mzjp2 commented 3 years ago

That looks awesome :D

Might be worth adding it to the list of community plugins in the Kedro documentation? https://kedro.readthedocs.io/en/latest/07_extend_kedro/04_plugins.html#community-developed-plugins

szczeles commented 3 years ago

@mzjp2 Good idea! Just added: #680

kedro-org / kedro

[Q] Kedro + Kubeflow = ? #353