kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 879 forks source link

Add docs databricks asset bundles #3744

Closed erwinpaillacan closed 3 days ago

erwinpaillacan commented 4 months ago

Description

This pull request was initiated to assist in establishing a project utilizing asset bundles on Databricks, as the use of DBX is deprecated and no longer recommended.

https://www.databricks.com/blog/announcing-general-availability-databricks-asset-bundles

Development notes

make build-docs

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

erwinpaillacan commented 4 months ago

@cilopezs also is helping!

astrojuanlu commented 4 months ago

Thank you @erwinpaillacan and @cilopezs for this PR!

This addresses #3360 in part. In your opinion, do you think it still makes sense to keep the DBX docs around? I was thinking that we should remove them, and replace them by what you did here.

erwinpaillacan commented 4 months ago

Thank you @erwinpaillacan and @cilopezs for this PR!

This addresses #3360 in part. In your opinion, do you think it still makes sense to keep the DBX docs around? I was thinking that we should remove them, and replace them by what you did here.

Yes, in our opinion we could update the page https://docs.kedro.org/en/stable/deployment/databricks/databricks_ide_development_workflow.html with databricks connect, which for some time was deprecated but now it is live again and being recommended for development https://docs.databricks.com/en/dev-tools/databricks-connect/python/index.html

and for deployment we can rely on asset bundle which is the main purpose.

I think we need to update the decision plot, right?

erwinpaillacan commented 4 months ago

Heads up! @astrojuanlu @cilopezs

  1. Development workflow updated to Databricks Connect: https://docs.databricks.com/en/dev-tools/databricks-connect/python/index.html
  2. Deployment workflow asset bundles: https://docs.databricks.com/en/dev-tools/bundles/index.html

We are removing dbx

astrojuanlu commented 4 months ago

The docs errors are legitimate, please address them so RTD can render the new docs 👍🏽

astrojuanlu commented 4 months ago

Rendered docs for reviewers:

astrojuanlu commented 1 month ago

Erwin confirmed internally that this PR will need an update to work on Kedro 0.19 👍🏼

ankatiyar commented 1 month ago

Hey @erwinpaillacan, can I help you update this PR to work with Kedro 0.19.x?

erwinpaillacan commented 1 month ago

@astrojuanlu Just tested today with kedro 0.19.6 all working good, except that I needed to change manually the kedro dataset dependencies

slack

JenspederM commented 1 month ago

Hey!

I'm a bit late to the party, but I just wanted to let you know that I have previously made a databricks bundle template to illustrate how one could get started with Kedro on databricks.

I'm in the process of converting the logic introduced in the template into a kedro plugin - see more here.

I think the plugin would be very helpful as it makes it easier to deploy existing projects to Databricks, whereas both the template made by me or the databricks-iris starter are only relevant for new projects.

Please note that the plugin is still in early development, so if you have any suggestions to align with your vision please let me know!

erwinpaillacan commented 1 month ago

Hey!

I'm a bit late to the party, but I just wanted to let you know that I have previously made a databricks bundle template to illustrate how one could get started with Kedro on databricks.

I'm in the process of converting the logic introduced in the template into a kedro plugin - see more here.

I think the plugin would be very helpful as it makes it easier to deploy existing projects to Databricks, whereas both the template made by me or the databricks-iris starter are only relevant for new projects.

Please note that the plugin is still in early development, so if you have any suggestions to align with your vision please let me know!

Looks great!! Just one question: is this plugin trying to maps n nodes to n tasks? or n pipelines to n task

JenspederM commented 1 month ago

@erwinpaillacan it maps pipelines to workflows with nodes as tasks.

That is to say, I did my best to mimic the view of Kedro-viz in the workflow tab of the Databricks UI

JenspederM commented 1 month ago

I just updated the readme to shed some light on the functionality that I'm intending to implement. I say intend as the 'deploy' command isn't ready yet.

I also published it as 'kedro-databricks-dev' as the other name is already taken by an empty project. I will reach out to the author of the other project so that we can hopefully get a sensible name for the package 😊

astrojuanlu commented 1 month ago

I also published it as 'kedro-databricks-dev' as the other name is already taken by an empty project. I will reach out to the author of the other project so that we can hopefully get a sensible name for the package 😊

cc @em-pe :)

astrojuanlu commented 2 weeks ago

I have to say I did some experiments with @JenspederM using the databricks-iris starter and it mostly worked! Maybe we could instruct users to use that instead of writing the bundle configs by hand.

Beyond this point, I leave it on the hands of @ankatiyar, who will be looking at this soon 😄

em-pe commented 2 weeks ago

I also published it as 'kedro-databricks-dev' as the other name is already taken by an empty project. I will reach out to the author of the other project so that we can hopefully get a sensible name for the package

@JenspederM I'm happy to pass kedro-databricks to you, just you let me know your pypi username.

JenspederM commented 2 weeks ago

I also published it as 'kedro-databricks-dev' as the other name is already taken by an empty project. I will reach out to the author of the other project so that we can hopefully get a sensible name for the package

@JenspederM I'm happy to pass kedro-databricks to you, just you let me know your pypi username.

I see you found it without my help. But thank you for transferring the project, @em-pe!

I have now published the first release to kedro-databricks. This release solved the most obvious issues found by @astrojuanlu. I'm just finishing up the databricks_run script after which all immediate issues should have been addressed.

I will make an announcement on Slack as soon as I have a working example with the databricks-iris starter :)