Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
We concluded a research item on how Kedro is being used on Databricks (#2105). This task makes a recommendation to improve our Deployment to a Databricks cluster documentation.
Context
We will work on a Kedro-Databricks plugin at a later stage but first we'll overhaul the documentation because there was an insight about how much our users rely on it to get their work done. At this point in time, we'll recommend use of dbx and Databricks Repos as a way to use Kedro on Databricks.
Include an introduction about why you would choose to use Kedro on Databricks
Recommend a workflow for syncing the latest version of their code written in an IDE to the Databricks workspace; we should recommend Databricks Repos and dbx sync as the way to do this
Recommend a workflow for running their pipelines on Databricks; we should recommend use of the iPython extension (used through a Databricks notebook) or use of dbx deploy
Recommend a workflow for visualising their pipeline through a Databricks notebook (this section is written, it just needs to be made more prominent)
Additionally, please walk users through being able to configure dbx and Databricks Repos so that they can use this functionality
Medium-priority
Provide recommendations specific to Azure; our documentation is heavily based on AWS
Description
We concluded a research item on how Kedro is being used on Databricks (#2105). This task makes a recommendation to improve our Deployment to a Databricks cluster documentation.
Context
We will work on a Kedro-Databricks plugin at a later stage but first we'll overhaul the documentation because there was an insight about how much our users rely on it to get their work done. At this point in time, we'll recommend use of
dbx
and Databricks Repos as a way to use Kedro on Databricks.Possible Implementation
Our Deployment to a Databricks cluster documentation needs quite a bit of help in the following ways:
dbx sync
as the way to do thisdbx deploy
dbx
and Databricks Repos so that they can use this functionality