Data preparation step in training pipeline

Add data prep as initial step in the training pipeline, where all feature engineering and train-test split work will be done. Providing by default the train and holdout test datasets will enforce good practices and avoid data leakage, thus accelerating the model performance analysis and reporting.

Train sub-dataset should be redirected to the train step (2nd step in the pipeline), and test sub-dataset should be redirected to the evaluation step (3rd in the pipeline). As a result, evaluation step should be modified to include the generation of evaluation metrics, while comparison with the current active model should be done later (as a part of the register step? or include a compare step in between?).

The train step can still have its own data splitting mechanism inside, to do any type of cross-validation needed to select the best model from all the approaches tested out.

microsoft / dstoolkit-mlops-base

Data preparation step in training pipeline #27