yetudada commented 1 year ago

Introduction

This ticket closes #1653 and presents a summary of our findings and what we're going to do.

Why did we conduct this research?

Databricks is a machine-learning platform that is primarily used to run Spark-based workflows. We have a growing category of Kedro users that rely on Databricks to run their data and machine-learning pipelines on large datasets.

Databricks appears to be the dominant machine-learning (ML) platform, when an ML platform was used.

🗳️ Adoption is:

Databricks (43%)
AWS SageMaker (24%)
Azure ML (12%)
Google Cloud Vertex AI (10%)
Dataiku (4%)
Palantir Foundry (4%)
DataRobot (2%)

What did we want to find out?

All objectives are indicated in #1653 and they include:

Key barriers to adoption of Kedro when Databricks is used
User workflows while using an IDE with Databricks
A prioritised list of opportunities created by:
- The IDE workflow in Databricks
- And, use of Kedro on Databricks

How did we conduct this research

The research used qualitative (interview 🎤) and quantitative (polls 🗳️ and a survey 📊) across our open-source user base.

Participant count across the entire user base:

Interview 🎤 - 16 participants
Polls 🗳️ - 140 participants
Survey 📊 - 46 participants

Note: We will also use the emojis as keys to indicate where data or insights have come from.

Insights

Key barriers to adoption of Kedro when Databricks is used

There are certain components that affect Kedro adoption relative to Databricks:

📊 39% of respondents - A preference for notebook-based workflows over the IDE workflow either because of user expertise or because of the challenges with IDE support in the Databricks ecosystem
📊 22% of respondents - A lack of native integration with the overall Databricks workspace and in particular, the Azure ecosystem as the preferred cloud platform
📊 15% of respondents - A perception that Kedro and Databricks are competing or incompatible

User workflows while using an IDE with Databricks

Our users need to do a few things to be successful while developing their ML products on Databricks, using Kedro and an IDE.

Ranked in terms of priority, they need to be able to:

1. Have the latest version of their codebase to run their pipeline using Databricks 2. Run their Kedro pipelines using a Databricks cluster

Test their code while in the experimental phase to see that it’s working
Write production code when they are happy with the way that their code works
Schedule and monitor pipeline runs
Track their ML experiments and save important artefacts
Visualise their pipeline
Version data

Priority workflows

Have the latest version of their codebase to run their pipeline using Databricks

Workflow 1: 📊 70% of our users will use an IDE to make changes and use git to pull their changes into Databricks workspace by:

Making their changes in an IDE
Committing and pushing their changes using git
Pulling their changes into the Databricks workspace using:
- Databricks Repos (emerging and common)
- git-python in a Notebook (seen in projects before Databricks Repos) for version-control support
📊 30% of users feel like it has a lot of steps

Workflow 2: 📊 9% of users will use dbx sync to keep their workspace synced to Databricks Repos:

This workflow has a fault, dbx sync only works from IDE -> Databricks Workspace and when users create new notebooks on the Databricks Workspace then those changes cause conflicts or are lost
Users also did not seem to know about dbx

Run their Kedro pipelines using a Databricks cluster

Workflow 1: 📊 81% of users will run the Kedro pipelines from Databricks notebook once they have the latest version of their code:

They didn’t seem to know about the Kedro iPython extension and it may be due to placement in the documentation
📊 24% of users run into problems when they tried to use Databricks shell magic %sh kedro run in a Databricks notebook, they wanted to do this because:
- Users would prefer to use the shell magic because Kedro prioritises CLI interaction in our documentation and this makes handover easier for people who are not acquainted with Kedro.
- Additionally, command failure is not always consistent, it appears use of Spark causes failure.

Workflow 2: 📊 10% of users run their Kedro pipelines by packaging them and uploading the file to file storage (Databricks File Storage or Azure Blob Storage) and then running it as a Python package:

Kedro does not store configuration in src/ and this caused used to copy the folder into src/ or create CI/CD that would do that for them.
- If the configuration folder was moved then it would cause new users to be confused when they encountered the project template.
One user wanted to leverage dbx deploy to do the same

Workflow 3: 📊 9% of users still use Databricks Connect from their IDE even though it will be sunset

Databricks Connect does not work with Python workflows and is only ideal for Spark
📊 39% of these users them encountered issues with a lack of Python (or loosely Pandas) support

What are we going to do?

What are our High Priority

🔧 Develop a Kedro-Databricks plugin based on the Databricks CLI or dbx that will cover the two priority workflows:
- Have the latest version of their codebase to run their pipeline using Databricks
- Run their Kedro pipelines using a Databricks cluster
📚 Our users heavily rely on our documentation, we need to overhaul Deployment to a Databricks cluster documentation (#2185)
🎁 Address issues related to packaging of the conf directory in src in the Configuration Overhaul (#1908)
❤️ Publishing of Kedro/Databricks content to show that we’re compatible tools

What are our medium priorities?

🗄️ Rebuild our Delta Lake support; our DeltaTableDataset is not used at all and our users are building their own

What are our low priorities?

📆 Build integration for the Databricks Jobs API; 📊 half of our users use it without concern
🌱 Build support for Delta Live Tables; is currently a low priority because we found no users of Delta Live Tables in this study

What have we already fixed?

🗒️ A way for users to run Kedro-Viz through a Databricks notebook, and not locally
🪵 Logging errors caused by an ability to write files (including logging files) to Databricks Repos; we made logging optional

What we will not be able to address?

We're ultimately left to leverage whatever IDE support Databricks rolls out; we will be super users of this functionality
- dbx seems to be a great start for supporting users but some users mentioned the following workflow challenges:
  - One-way sync which is IDE to Databricks Workspace; it doesn’t work the other way around so work is lost if you open a Databricks notebook on the Databricks Workspace
  - Ignores .gitignore files and syncs everything
Users were confused about which file path to use in their Data Catalog to load mounted storage in Databricks; specifically affecting Azure mounted storage users. Documentation on which path to use /dbfs/<file_path>, dbfs://<file_path> or /mnt/dbfs/<file_path>
📊 33% of our users complain about suppressing Java errors in their notebooks when they run Python in notebooks
📊 Half of our users complain about the process of managing requirements for their Databricks cluster

yetudada commented 1 year ago

We've shipped our Databricks docs: https://docs.kedro.org/en/stable/deployment/databricks/index.html

We'll track adoption and would appreciate help with the Databricks plugin.

vitoravancini commented 9 months ago

Hello, I know this is a closed issue but recently Databricks de-deprecated databricks-connect. Databricks-connect is on the official documentation and they are releasing new versions.

Do you think this could change anything for the kedro + databricks approach? Databricks-connect was by far the best experience developing kedro pipelines with databricks I've had.

Thank you!

noklam commented 9 months ago

@vitoravancini https://kedro.org/blog/how-to-integrate-kedro-and-databricks-connect

It works quite well when you have a spark only pipeline, the blog post describe more in details.

kedro-org / kedro

Research synthesis on evaluating the Kedro and Databricks workflow #2105