kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.48k stars 874 forks source link

Kedro Stack #2053

Closed merelcht closed 6 months ago

merelcht commented 1 year ago

Description

A go-to Kedro Stack

Implementation idea

Questions

yetudada commented 1 year ago

A way of doing this could be through creating the stack on MLOPs stack.

noklam commented 1 year ago

I think we need to discuss what is the scope of this, often times when people talk about end-to-end ML it's vague.

According to the MLOps stack, it covers these components:

  • Experiment Tracking (development)
  • Data Versioning (development & deployment)
  • Code Versioning
  • Pipeline orchestration
  • Runtime engine
  • Artifact tracking
  • Model Serving
  • Model Monitoring
  • Data monitoring/validation (Great Expectations or something else) - this isn't covered by the MLOps stack

When I think about the stack, I am thinking of something with minimal scope. Obviously, you still need Git and some monitoring service, but it's not included in the MEAN/LAMP stack. The same goes with the ML Stack, what's the real minimal stack we needed?

I think the more important missing parts might be serving, artifact store. Something like Great Expectations for data validation would be a PLUS but I don't think this is strictly necessary.

astrojuanlu commented 11 months ago

Some insights about Airflow's dominance https://www.linkedin.com/posts/hugo-lu-confirmed_dataorchestration-dataengineering-dataengineers-activity-7094595004576227328-slSU

Pandera + Airflow + Kedro = PAK? 😄

noklam commented 11 months ago

Kedro needs to be in the middle 😀

astrojuanlu commented 8 months ago

Another idea for a Kedro stack: https://linen-slack.kedro.org/t/16014653/hello-very-much-new-to-the-ml-world-i-m-trying-to-setup-a-fr#6546163c-e141-4c07-ae28-71bf31dd25b7

  • kedro for creating training pipelines and overall project structure
  • mlflow for experiment tracking and model registry
  • dvc for dataset versioning
  • TensorFlow for machine learning framework
  • RayTune for hyperparameter tuning
astrojuanlu commented 8 months ago

For reference, a Spark-centric, fully open source, Kedro-based stack using mymlops.com

image

merelcht commented 6 months ago

Closing as this isn't a priority for now.