microsoft / dstoolkit-mlops-base

Support ML teams to accelerate their model deployment to production leveraging Azure
MIT License
89 stars 39 forks source link

Data preparation step in training pipeline #27

Closed mariamedp closed 2 years ago

mariamedp commented 2 years ago

Add data prep as initial step in the training pipeline, where all feature engineering and train-test split work will be done. Providing by default the train and holdout test datasets will enforce good practices and avoid data leakage, thus accelerating the model performance analysis and reporting.

Train sub-dataset should be redirected to the train step (2nd step in the pipeline), and test sub-dataset should be redirected to the evaluation step (3rd in the pipeline). As a result, evaluation step should be modified to include the generation of evaluation metrics, while comparison with the current active model should be done later (as a part of the register step? or include a compare step in between?).

The train step can still have its own data splitting mechanism inside, to do any type of cross-validation needed to select the best model from all the approaches tested out.