Create a databricks-iris starter that enables packaged deployment on Databricks

jmholzer commented 1 year ago

Motivation and Context

The guide on deploying packaged projects to Databricks proposed in https://github.com/kedro-org/kedro/pull/2595 uses the databricks-iris starter. This PR adds this starter. The databricks-iris starter is a duplicate of the pyspark-iris starter with a few changes.

databricks_run.py: a module for running the project on Databricks, as Click causes us to be unable to run projects with the default entry point on Databricks.
Project logs are written directly to DBFS (conf/base/logging.yml).
All datasets in conf/base/catalog.yml are saved in /dbfs/FileStore.

This PR has a large diff because it is a brand new starter, only the following files have been changed from pyspark-iris:

{{ cookiecutter.repo_name }}/src/setup.py: contains an entry point definition databricks_run.
{{ cookiecutter.repo_name }}/src/{{ cookiecutter.python_package }}/databricks_run.py: contains a script needed to run a packaged Kedro project on Databricks.
{{ cookiecutter.repo_name }}/src/conf/base/logging.yml: config for writing logs to DBFS.
{{ cookiecutter.repo_name }}/src/conf/base/catalog.yml: points to datasets on DBFS.

How has this been tested?

Manually on Databricks in conjunction with the new guide.

Checklist

[ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[ ] Assigned myself to the PR
[ ] Added tests to cover my changes

astrojuanlu commented 1 year ago

To test this:

kedro new --starter git+https://github.com/kedro-org/kedro-starters.git --directory databricks-iris --checkout feat/modify-pyspark-iris-databricks-packaged-deployment

jmholzer commented 1 year ago

To test this:

kedro new --starter git+https://github.com/kedro-org/kedro-starters.git --directory databricks-iris --checkout feat/modify-pyspark-iris-databricks-packaged-deployment

Thanks for figuring this out @astrojuanlu!

kedro-org / kedro-starters