Currently, our models are trained offline, and we deploy them directly as a prediction service. This manual process, as described below, comes with many challenges and doesn't help us with:
Maintaining code reusability
Managing models and retraining them frequently
Ensuring model training performance reproducibility in production
Allowing team members, contributors to experiment, incorporate the latest methods, and iterate quickly while having a clear target of a production-ready forecasting pipeline
Implement and automate retraining strategies and model deployment
They are other challenges such as model testing and monitoring (but let's deal with them in another issue 😊 )
Why is it important for us:
Any changes to the training dataset, processing pipeline, and modeling approaches that improve performances should lead to rerunning our training pipeline
With the project being open-source, we need to find a way to scale collaboration between multiple contributors and ensure each contribution doesn't get lost!
All of that while ensuring:
Training reproducibility
Machine learning system traceability (code + data + config)
Consistency between training and prediction
🚀 Feature
Orchestrating and automating our training pipeline following the workflow described bellow
Design Proposal
Let' adopt a light well version of the pipeline describe above:
Public Dataset Registry or Feature Store: Gdrive or Bucket
Private Model Registry, Metadata Registry, Prediction Registry: Gdrive or Bucket
They are many tools available for orchestrating and managing machine learning pipelines.
(Airflow, Dagster, Prefect, MlFlow) and we discussed some of them with @GHCamille. I just discovered DVC, and it is really lightwell compare to others. Moreover, it's free, and it doesn't require any additional infrastructure!
Motivation
Currently, our models are trained offline, and we deploy them directly as a prediction service. This manual process, as described below, comes with many challenges and doesn't help us with:
They are other challenges such as model testing and monitoring (but let's deal with them in another issue 😊 )
Why is it important for us:
Any changes to the training dataset, processing pipeline, and modeling approaches that improve performances should lead to rerunning our training pipeline
With the project being open-source, we need to find a way to scale collaboration between multiple contributors and ensure each contribution doesn't get lost!
All of that while ensuring:
🚀 Feature
Orchestrating and automating our training pipeline following the workflow described bellow
Design Proposal
Let' adopt a light well version of the pipeline describe above:
To ensure we keep things tidy, I think we adopt the following layers and the following package structure:
Dags definition layer: operators scripts outside of the package
Setting this workflow would require to:
Alternatives
They are many tools available for orchestrating and managing machine learning pipelines. (Airflow, Dagster, Prefect, MlFlow) and we discussed some of them with @GHCamille. I just discovered DVC, and it is really lightwell compare to others. Moreover, it's free, and it doesn't require any additional infrastructure!
Additional Context
The diagrams are taken from this google article: MLOps: Continuous delivery and automation pipelines in machine learning
DVC and CML use cases
-> Sharing Data and Model File -> CML with DVC
As always, let me know what you think and see you in 🚀 production my friends 🚀