Motivation

Currently, our models are trained offline, and we deploy them directly as a prediction service. This manual process, as described below, comes with many challenges and doesn't help us with:

Maintaining code reusability
Managing models and retraining them frequently
Ensuring model training performance reproducibility in production
Allowing team members, contributors to experiment, incorporate the latest methods, and iterate quickly while having a clear target of a production-ready forecasting pipeline
Implement and automate retraining strategies and model deployment

They are other challenges such as model testing and monitoring (but let's deal with them in another issue 😊 )

Why is it important for us:

Any changes to the training dataset, processing pipeline, and modeling approaches that improve performances should lead to rerunning our training pipeline
With the project being open-source, we need to find a way to scale collaboration between multiple contributors and ensure each contribution doesn't get lost!
All of that while ensuring:
- Training reproducibility
- Machine learning system traceability (code + data + config)
- Consistency between training and prediction

🚀 Feature

Orchestrating and automating our training pipeline following the workflow described bellow

Design Proposal

Let' adopt a light well version of the pipeline describe above:

Public Dataset Registry or Feature Store: Gdrive or Bucket
Private Model Registry, Metadata Registry, Prediction Registry: Gdrive or Bucket
Private Prediction Registry: Gdrive or Bucket
Orchestration framework: DVC
Model and Metadata tracking: DVC
Pipeline Retraining automation: DVC + GitHub Actions + CML

To ensure we keep things tidy, I think we adopt the following layers and the following package structure:

Function definition layer: corresponding to the Data Science Workflow
Task or stages definition layer: to ensure compatibility with multiple orchestration frameworks

Dags definition layer: operators scripts outside of the package

├── dags
│   ├── dvc
│   │   ├── train
│   │   │   └──dvc.yaml
│   └── airflow
│   │   ├── train.py
├── pyro_risks
│   ├── __init__.py
│   ├── config
│   ├── datasets
│   ├── pipelines
│   │   ├── __init__.py
│   │   ├── evaluate.py
│   │   ├── load.py
│   │   ├── predict.py
│   │   └── train.py
│   ├──  main.py --> command line
│   └──  version.py

Setting this workflow would require to:

[x] Add and refactor the tasks / stages
[x] Add package entry point
[x] Add dags definition
[x] Add orchestration tools configuration files

Alternatives

They are many tools available for orchestrating and managing machine learning pipelines. (Airflow, Dagster, Prefect, MlFlow) and we discussed some of them with @GHCamille. I just discovered DVC, and it is really lightwell compare to others. Moreover, it's free, and it doesn't require any additional infrastructure!

Additional Context

The diagrams are taken from this google article: MLOps: Continuous delivery and automation pipelines in machine learning

DVC and CML use cases

-> Sharing Data and Model File -> CML with DVC

As always, let me know what you think and see you in 🚀 production my friends 🚀

pyronear / pyro-risks

[🚀 MLOPS 🚀 ] Training pipeline life-cycle management #51

Motivation

🚀 Feature

Design Proposal

Alternatives

Additional Context