Closed merelcht closed 1 year ago
The different starters we need are:
I've mapped out the differences between these various projects, the green highlighting means a change is required in the file, the ⭐️ indicates is a new file that needs to be added.
├── conf
│ ├── base
+ │ ├── catalog.yml
+ │ ├── spark.yml ⭐️
│ │ ├── parameters.yml
│ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
│ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ ├── __init__.py
│ │ ├── main.py
+ │ ├── hooks.py ⭐️
│ │ ├── pipeline_registry.py
+ │ ├── settings.py
│ ├── tests
+ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml
├── conf
│ ├── base
+ │ ├── catalog.yml
+ │ ├── spark.yml ⭐️
│ │ ├── parameters.yml
+ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
│ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ ├── __init__.py
│ │ ├── main.py
+ │ ├── databricks_run.py ⭐️
+ │ ├── hooks.py ⭐️
│ │ ├── pipeline_registry.py
+ │ ├── settings.py
│ ├── tests
+ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml
├── conf
│ ├── base
+ │ ├── catalog.yml
│ │ ├── spark.yml
│ │ ├── parameters.yml
+ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
│ │ │ │ ├── nodes.py
│ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
│ │ │ │ ├── nodes.py
│ │ │ │ ├── pipeline.py
│ │ ├── __init__.py
│ │ ├── main.py
+ │ ├── databricks_run.py ⭐️
│ │ ├── hooks.py
│ │ ├── pipeline_registry.py
│ │ ├── settings.py
│ ├── tests
│ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml
Viz features added: experiment tracking, plotting with Plotly, and plotting with Matplotlib
├── conf
│ ├── base
+ │ ├── catalog.yml
│ │ ├── parameters.yml
│ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
+ │ │ ├── reporting ⭐️
+ │ │ │ ├── nodes.py ⭐️
+ │ │ │ ├── pipeline.py ⭐️
│ │ ├── __init__.py
│ │ ├── main.py
│ │ ├── pipeline_registry.py
+ │ ├── settings.py
│ ├── tests
+ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml
├── conf
│ ├── base
+ │ ├── catalog.yml
│ │ ├── spark.yml
│ │ ├── parameters.yml
│ │ ├── logging.yml
│ ├── local
├── data
├── docs
├── notebooks
├── src
│ ├── spaceflights
│ │ ├── pipelines
│ │ │ ├── data_processing
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
│ │ │ ├── data_science
+ │ │ │ ├── nodes.py
+ │ │ │ ├── pipeline.py
+ │ │ ├── reporting ⭐️
+ │ │ │ ├── nodes.py ⭐️
+ │ │ │ ├── pipeline.py ⭐️
│ │ ├── __init__.py
│ │ ├── main.py
│ │ ├── hooks.py
│ │ ├── pipeline_registry.py
+ │ ├── settings.py
│ ├── tests
+ ├── requirements.txt
│ ├── setup.py
└── pyproject.toml
Based on the above, the only obvious merging of projects I see is with the Pyspark and Databricks examples. The other combinations require a lot of changes and the reduction we'd get in maintenance burden for the starters would be added complexity in logic on how to pull in the correct examples for users in the kedro new
selection flow.
Based on these findings - would you recommend merging the Pyspark and Databricks example then? What projects would we expose to the users in the new Starter repo.
When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose.
How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before.
Based on these findings - would you recommend merging the Pyspark and Databricks example then?
Yes, that one is easy to merge and also de-duplicate.
What projects would we expose to the users in the new Starter repo.
When we last discussed there was a difference between how it all worked "behind the scenes" and what it would look like to the user. i.e. no need for the user to know that projects were constructed using cookiecutter if thats what we chose. How are we managing files that are the same across starters? Maybe we can have a starter template internally so in future we ensure starters all have the same core, avoiding the problem we had before.
I think that's basically the "vanilla" spaceflights. If we ever create a new starter it should just be based on that.
My recommendation is to only merge the pyspark and Databricks starter and keep the rest separate. This means we need to create:
Description
Follow up on https://github.com/kedro-org/kedro/issues/2758 and https://github.com/kedro-org/kedro/issues/2838
We should look at how do we deal with overlaps in spaceflight projects. Can we somehow combine them to lessen the maintenance burden?
Context
In https://github.com/kedro-org/kedro/issues/2838 we'll add several new spaceflights based projects that will also serve as the examples a user can add to a project at creation with the new utilities flow. These examples will all likely have similar files, so the question is do we need to have each complete project or can we somehow combine them and still serve the purpose of providing users different examples?
Possible Implementation
The aim of this spike is to come up with possible implementations for merged examples.