kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
10.01k stars 904 forks source link

Spike: how to handle overlap in example spaceflights projects? #2874

Closed merelcht closed 1 year ago

merelcht commented 1 year ago

Description

Follow up on https://github.com/kedro-org/kedro/issues/2758 and https://github.com/kedro-org/kedro/issues/2838

We should look at how do we deal with overlaps in spaceflight projects. Can we somehow combine them to lessen the maintenance burden?

Context

In https://github.com/kedro-org/kedro/issues/2838 we'll add several new spaceflights based projects that will also serve as the examples a user can add to a project at creation with the new utilities flow. These examples will all likely have similar files, so the question is do we need to have each complete project or can we somehow combine them and still serve the purpose of providing users different examples?

Possible Implementation

The aim of this spike is to come up with possible implementations for merged examples.

merelcht commented 1 year ago

Useful links:

merelcht commented 1 year ago

The different starters we need are:

I've mapped out the differences between these various projects, the green highlighting means a change is required in the file, the ⭐️ indicates is a new file that needs to be added.

Spaceflights Pandas -> Spaceflights Pyspark

├── conf
│   ├── base
+   │   ├── catalog.yml
+   │   ├── spark.yml ⭐️
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── hooks.py ⭐️
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pandas -> Spaceflights Databricks

├── conf
│   ├── base
+   │   ├── catalog.yml
+   │   ├── spark.yml ⭐️
│   │   ├── parameters.yml
+   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── databricks_run.py ⭐️
+   │   ├── hooks.py ⭐️
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pyspark -> Spaceflights Databricks

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── spark.yml 
│   │   ├── parameters.yml
+   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
│   │   │   │   ├── nodes.py
│   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
│   │   │   │   ├── nodes.py
│   │   │   │   ├── pipeline.py
│   │   ├── __init__.py
│   │   ├── main.py
+   │   ├── databricks_run.py ⭐️
│   │   ├── hooks.py 
│   │   ├── pipeline_registry.py
│   │   ├── settings.py
│   ├── tests
│   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pandas -> Spaceflights Pandas Viz

Viz features added: experiment tracking, plotting with Plotly, and plotting with Matplotlib

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
+   │   │   ├── reporting ⭐️
+   │   │   │   ├── nodes.py ⭐️
+   │   │   │   ├── pipeline.py ⭐️
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Spaceflights Pyspark -> Spaceflights Pyspark Viz

├── conf
│   ├── base
+   │   ├── catalog.yml
│   │   ├── spark.yml 
│   │   ├── parameters.yml
│   │   ├── logging.yml
│   ├── local
├── data
├── docs
├── notebooks
├── src
│   ├── spaceflights
│   │   ├── pipelines
│   │   │   ├── data_processing
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
│   │   │   ├── data_science
+   │   │   │   ├── nodes.py
+   │   │   │   ├── pipeline.py
+   │   │   ├── reporting ⭐️
+   │   │   │   ├── nodes.py ⭐️
+   │   │   │   ├── pipeline.py ⭐️
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── hooks.py 
│   │   ├── pipeline_registry.py
+   │   ├── settings.py
│   ├── tests
+   ├── requirements.txt
│   ├── setup.py
└── pyproject.toml

Based on the above, the only obvious merging of projects I see is with the Pyspark and Databricks examples. The other combinations require a lot of changes and the reduction we'd get in maintenance burden for the starters would be added complexity in logic on how to pull in the correct examples for users in the kedro new selection flow.

amandakys commented 1 year ago

Based on these findings - would you recommend merging the Pyspark and Databricks example then? What projects would we expose to the users in the new Starter repo.

When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose.

How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before.

merelcht commented 1 year ago

Based on these findings - would you recommend merging the Pyspark and Databricks example then?

Yes, that one is easy to merge and also de-duplicate.

What projects would we expose to the users in the new Starter repo.

  1. "Vanilla" spaceflights. This is the existing spaceflights project based on pandas.
  2. Spaceflights with Viz features
  3. Pyspark spaceflights, which also includes Databricks setup.
  4. Pyspark spaceflights with Viz features

When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose. How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before.

I think that's basically the "vanilla" spaceflights. If we ever create a new starter it should just be based on that.

merelcht commented 1 year ago

Conclusion

My recommendation is to only merge the pyspark and Databricks starter and keep the rest separate. This means we need to create:

  1. spaceflights based on pandas (the existing spaceflights starter)
  2. spaceflights based on pyspark + databricks
  3. spaceflights based on pandas with viz features enabled
  4. spaceflights based on pyspark with viz features enabled
  5. spaceflights with Airflow setup (maybe)