kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Research synthesis on evaluating the Kedro and Databricks workflow #2105

Closed yetudada closed 1 year ago

yetudada commented 1 year ago

Introduction

This ticket closes #1653 and presents a summary of our findings and what we're going to do.

Why did we conduct this research?

Databricks is a machine-learning platform that is primarily used to run Spark-based workflows. We have a growing category of Kedro users that rely on Databricks to run their data and machine-learning pipelines on large datasets.

Databricks appears to be the dominant machine-learning (ML) platform, when an ML platform was used.

πŸ—³οΈ Adoption is:

What did we want to find out?

All objectives are indicated in #1653 and they include:

How did we conduct this research

The research used qualitative (interview 🎀) and quantitative (polls πŸ—³οΈ and a survey πŸ“Š) across our open-source user base.

Participant count across the entire user base:

Note: We will also use the emojis as keys to indicate where data or insights have come from.

Insights

Key barriers to adoption of Kedro when Databricks is used

There are certain components that affect Kedro adoption relative to Databricks:

User workflows while using an IDE with Databricks

Our users need to do a few things to be successful while developing their ML products on Databricks, using Kedro and an IDE.

Ranked in terms of priority, they need to be able to:

1. Have the latest version of their codebase to run their pipeline using Databricks 2. Run their Kedro pipelines using a Databricks cluster

  1. Test their code while in the experimental phase to see that it’s working
  2. Write production code when they are happy with the way that their code works
  3. Schedule and monitor pipeline runs
  4. Track their ML experiments and save important artefacts
  5. Visualise their pipeline
  6. Version data

Priority workflows

Have the latest version of their codebase to run their pipeline using Databricks

Workflow 1: πŸ“Š 70% of our users will use an IDE to make changes and use git to pull their changes into Databricks workspace by:

Workflow 2: πŸ“Š 9% of users will use dbx sync to keep their workspace synced to Databricks Repos:

Run their Kedro pipelines using a Databricks cluster

Workflow 1: πŸ“Š 81% of users will run the Kedro pipelines from Databricks notebook once they have the latest version of their code:

Workflow 2: πŸ“Š 10% of users run their Kedro pipelines by packaging them and uploading the file to file storage (Databricks File Storage or Azure Blob Storage) and then running it as a Python package:

Workflow 3: πŸ“Š 9% of users still use Databricks Connect from their IDE even though it will be sunset

What are we going to do?

What are our High Priority

What are our medium priorities?

What are our low priorities?

What have we already fixed?

What we will not be able to address?

yetudada commented 1 year ago

We've shipped our Databricks docs: https://docs.kedro.org/en/stable/deployment/databricks/index.html

We'll track adoption and would appreciate help with the Databricks plugin.

vitoravancini commented 7 months ago

Hello, I know this is a closed issue but recently Databricks de-deprecated databricks-connect. Databricks-connect is on the official documentation and they are releasing new versions.

Do you think this could change anything for the kedro + databricks approach? Databricks-connect was by far the best experience developing kedro pipelines with databricks I've had.

Thank you!

noklam commented 7 months ago

@vitoravancini https://kedro.org/blog/how-to-integrate-kedro-and-databricks-connect

It works quite well when you have a spark only pipeline, the blog post describe more in details.