kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.91k stars 900 forks source link

Investigate options for Starter restructure #2505

Closed amandakys closed 1 year ago

amandakys commented 1 year ago

Description

Our current approach to starters is not very cohesive with no clear strategy. Investigate ways to change this to help tackle adoption numbers. This builds off concept 2 in the motivation modular Kedro work desribed in #2388

Concept 2: Improve starter journey to increase accessibility of Kedro

This ticket describes an alternative approach to starters, that is complementary to our Kedro Utilities proposal. As part of the Kedro utilities work, it came up that the proposed utilities and our starters had some similarities. Upon breaking down the structure of the existing starters, some patterns and inconsistencies started to emerge.

image

Furthermore, the current list of starters felt disparate and broad in their goals. There was a general leaning towards showcasing how Kedro could integrate with other libraries like pyspark, astro-airflow etc, through the use of a ‘example starter project’.

Possible Implementation

I propose that we combine our current concept of starters with our new utility modules workflow. At project creation, users will be asked to choose from different components that they want to add to their project.

Continuing with the theme of a simplified project starter, with Add-Ons, every resultant project would start from the same basic template. Building on this, if our team chooses to enforce a more consistent way to provide ‘example code’ i.e. default node and pipeline code, consistent test directory, this would also improve our user’s ability to mix and match examples.

Technical Details: cookiecutter allows you to initialise and add code based on booleans, this feature should enable us to adapt the ‘basic’ template based on a set of flags provided by the user on project creation.

Integration Add-Ons

Goal: allow Kedro to support third-party libraries

Example Projects

Goal: showcase Kedro features, as a team we show others how to use Kedro

Initial Prototype (WIP)

Project Add-Ons 
================
Here you can select which add-ons you'd like to include. 
Don't worry if you change your mind you can always add/remove these later.
To read more about these utilities and what they do visit: kedro.org/

Add-Ons 
1) Linting :      Provides linting set up with Flake8, Black and isort 
2) Testing :      Provides testing set up with pytest 
3) Logging :      Provides more logging options
4) Documentation:      Provides documentation setup with Sphinx
5) Databricks:      Provides set up for working with Databricks
6) PySpark:       Provides set up configuration for working with PySpark
7) Airflow:       Provides minimal setup to deploy a pipeline to Airflow using Astronomer
8) Kedro-Viz:       Provides Kedro's native visualisation tool 
     8a) Plotly:       Provides interactive pipeline visualisations 
     8b) Experiment-Tracking:       Sets up experiment tracking, to compare runs 

Which add-ons would you like to include in your project? [1-4/all/1,3/none]: 

Would you like to include an example pipeline?[y/n]: 

Note: The flow for kedro-viz as an add-on needs further work and I am working with @NeroOkwa and the Viz team on this.

Beyond the Add-Ons

Design Next Steps

yetudada commented 1 year ago

Here are a list of officially supported starters, from the list, we'll see:

yetudada commented 1 year ago

I'll close this because #2838 exists 🥇