Closed yetudada closed 1 year ago
We've shipped our Databricks docs: https://docs.kedro.org/en/stable/deployment/databricks/index.html
We'll track adoption and would appreciate help with the Databricks plugin.
Hello, I know this is a closed issue but recently Databricks de-deprecated databricks-connect. Databricks-connect is on the official documentation and they are releasing new versions.
Do you think this could change anything for the kedro + databricks approach? Databricks-connect was by far the best experience developing kedro pipelines with databricks I've had.
Thank you!
@vitoravancini https://kedro.org/blog/how-to-integrate-kedro-and-databricks-connect
It works quite well when you have a spark only pipeline, the blog post describe more in details.
Introduction
This ticket closes #1653 and presents a summary of our findings and what we're going to do.
Why did we conduct this research?
Databricks is a machine-learning platform that is primarily used to run Spark-based workflows. We have a growing category of Kedro users that rely on Databricks to run their data and machine-learning pipelines on large datasets.
Databricks appears to be the dominant machine-learning (ML) platform, when an ML platform was used.
π³οΈ Adoption is:
What did we want to find out?
All objectives are indicated in #1653 and they include:
How did we conduct this research
The research used qualitative (interview π€) and quantitative (polls π³οΈ and a survey π) across our open-source user base.
Participant count across the entire user base:
Insights
Key barriers to adoption of Kedro when Databricks is used
There are certain components that affect Kedro adoption relative to Databricks:
User workflows while using an IDE with Databricks
Our users need to do a few things to be successful while developing their ML products on Databricks, using Kedro and an IDE.
Ranked in terms of priority, they need to be able to:
1. Have the latest version of their codebase to run their pipeline using Databricks 2. Run their Kedro pipelines using a Databricks cluster
Priority workflows
Have the latest version of their codebase to run their pipeline using Databricks
Workflow 1: π 70% of our users will use an IDE to make changes and use
git
to pull their changes into Databricks workspace by:git
git-python
in a Notebook (seen in projects before Databricks Repos) for version-control supportWorkflow 2: π 9% of users will use
dbx sync
to keep their workspace synced to Databricks Repos:dbx sync
only works from IDE -> Databricks Workspace and when users create new notebooks on the Databricks Workspace then those changes cause conflicts or are lostdbx
Run their Kedro pipelines using a Databricks cluster
Workflow 1: π 81% of users will run the Kedro pipelines from Databricks notebook once they have the latest version of their code:
%sh kedro run
in a Databricks notebook, they wanted to do this because:Workflow 2: π 10% of users run their Kedro pipelines by packaging them and uploading the file to file storage (Databricks File Storage or Azure Blob Storage) and then running it as a Python package:
src/
and this caused used to copy the folder intosrc/
or create CI/CD that would do that for them.dbx deploy
to do the sameWorkflow 3: π 9% of users still use Databricks Connect from their IDE even though it will be sunset
What are we going to do?
What are our High Priority
dbx
that will cover the two priority workflows:conf
directory insrc
in the Configuration Overhaul (#1908)What are our medium priorities?
DeltaTableDataset
is not used at all and our users are building their ownWhat are our low priorities?
What have we already fixed?
What we will not be able to address?
dbx
seems to be a great start for supporting users but some users mentioned the following workflow challenges:.gitignore
files and syncs everything/dbfs/<file_path>
,dbfs://<file_path>
or/mnt/dbfs/<file_path>