NOTE: This repository is deprecated as of 2022/11/07, and will be removed soon. If you are using MLflow 2.0, please refer to MLflow Recipes Regression Template instead.
The MLflow Regression Pipeline is an MLflow Pipeline for developing
high-quality regression models.
It is designed for developing models using scikit-learn and frameworks that integrate with scikit-learn,
such as the XGBRegressor
API from XGBoost.
This repository is a template for developing production-ready regression models with the MLflow Regression Pipeline. It provides a pipeline structure for creating models as well as pointers to configurations and code files that should be filled in to produce a working pipeline.
Code developed with this template should be run with MLflow Pipelines. An example implementation of this template can be found in the MLP Regression Example repo, which targets the NYC taxi dataset for its training problem.
Note: MLflow Pipelines is an experimental feature in MLflow. If you observe any issues, please report them here. For suggestions on improvements, please file a discussion topic here. Your contribution to MLflow Pipelines is greatly appreciated by the community!
Follow the MLflow Pipelines installation guide. You may need to install additional libraries for extra features:
After installing MLflow Pipelines, you can clone this repository to get started. Simply fill in the required values annotated by FIXME::REQUIRED
comments in the Pipeline configuration file
and in the appropriate profile configuration: local.yaml
(if running locally) or databricks.yaml
(if running on Databricks).
The Pipeline will then be in a runnable state, and when run completely, will produce a trained model ready for batch
scoring, along with cards containing detailed information about the results of each step.
The model will also be registered to the MLflow Model Registry if it meets registration thresholds.
To iterate and improve your model, follow the MLflow Pipelines usage guide.
Note that iteration will likely involve filling in the optional FIXME
s in the
step code files with your own code, in addition to the configuration keys.
This is a visual overview of the MLflow Regression Pipeline's information flow.
Model develompent consists of the following sequential steps:
ingest -> split -> transform -> train -> evaluate -> register
The batch scoring workflow consists of the following sequential steps:
ingest_scoring -> predict
A detailed reference for each step follows.
Each of the steps in the pipeline produces artifacts after completion. These artifacts consist of cards containing
detailed execution information, as well as other step-specific information.
The Pipeline.inspect()
API is used to view step cards. The get_artifact
API is used to load all other step artifacts by name.
Per-step artifacts are further detailed in the following step references.
The ingest step resolves the dataset specified by the data
section in pipeline.yaml
and converts it to parquet format, leveraging the custom loader code specified in the data
section if necessary.
Note: If you make changes to the dataset referenced by the ingest step (e.g. by adding new records or columns),
you must manually re-run the ingest step in order to use the updated dataset in the pipeline.
The ingest step does not automatically detect changes in the dataset.
The custom loader function allows use of datasets in other formats, such as csv
.
The function should be defined in steps/ingest.py
,
and should accept two parameters:
file_path
: str
. Path to the dataset file.file_format
: str
. The file format string, such as "csv"
.It should return a Pandas DataFrame representing the content of the specified file. steps/ingest.py
contains an example placeholder function.
The input dataset is specified by the data
section in pipeline.yaml
as follows:
Step artifacts
ingested_data
: The ingested data as a Pandas DataFrame.The split step splits the ingested dataset produced by the ingest step into:
The fraction of records allocated to each dataset is defined by the split_ratios
attribute of the split
step
definition in pipeline.yaml
.
The split step also preprocesses the datasets using logic defined in steps/split.py
.
Subsequent steps use these datasets to develop a model and measure its performance.
The post-split method should be written in steps/split.py
and should accept three parameters:
train_df
: DataFrame. The unprocessed train dataset.validation_df
: DataFrame. The unprocessed validation dataset.test_df
: DataFrame. The unprocessed test dataset.It should return a triple representing the processed train, validation and test datasets. steps/split.py
contains an example placeholder function.
The split step is configured by the steps.split
section in pipeline.yaml
as follows:
Step artifacts:
training_data
: the training dataset as a Pandas DataFrame.validation_data
: the validation dataset as a Pandas DataFrame.test_data
: the test dataset as a Pandas DataFrame.The transform step uses the training dataset created by the split step to fit a transformer that performs the user-defined transformations. The transformer is then applied to the training dataset and the validation dataset, creating transformed datasets that are used by subsequent steps for estimator training and model performance evaluation.
The user-defined transformation function is not required. If absent, an identity transformer will be used.
The user-defined function should be written in
steps/transform.py
,
and should return an unfitted estimator that is sklearn-compatible; that is, the returned object should define
fit()
and transform()
methods. steps/transform.py
contains an example placeholder function.
The transform step is configured by the steps.transform
section in pipeline.yaml
:
Step artifacts:
transformed_training_data
: transformed training dataset as a Pandas DataFrame.transformed_validation_data
: transformed validation dataset as a Pandas DataFrame.transformer
: the sklearn transformer.The train step uses the transformed training dataset output from the transform step to fit an user-defined estimator. The estimator is then joined with the fitted transformer output from the transform step to create a model pipeline. Finally, this model pipeline is evaluated against the transformed training and validation datasets to compute performance metrics.
Custom evaluation metrics are computed according to definitions in steps/custom_metrics.py
and the metrics
section of pipeline.yaml
; see Custom Metrics section for reference.
The model pipeline and its associated parameters, performance metrics, and lineage information are logged to MLflow Tracking, producing an MLflow Run.
The user-defined estimator function should be written in steps/train.py
,
and should return an unfitted estimator that is sklearn
-compatible; that is, the returned object should define
fit()
and transform()
methods. steps/train.py
contains an example placeholder function.
The train step is configured by the steps.train
section in pipeline.yaml
:
Step artifacts:
model
: the MLflow Model pipeline created in the train step
as a PyFuncModel instance.The evaluate step evaluates the model pipeline created by the train step on the test dataset output from the split step, computing performance metrics and model explanations.
Performance metrics are compared against configured thresholds to produce a model_validation_status
, which indicates
whether or not a model is validated to be registered to the MLflow Model Registry
by the subsequent register step.
These model performance thresholds are defined in the
validation_criteria
section of the evaluate
step definition in pipeline.yaml
.
Custom evaluation metrics are computed according to definitions in steps/custom_metrics.py
and the metrics
section of pipeline.yaml
; see the custom metrics section for reference.
Model performance metrics and explanations are logged to the same MLflow Tracking Run used by the train step.
The evaluate step is configured by the steps.evaluate
section in pipeline.yaml
:
Step artifacts:
run
: the MLflow Tracking Run containing the model pipeline, as well as performance and metrics created during
the train and evaluate steps.The register step checks the model_validation_status
output of the preceding evaluate step and,
if model validation was successful (if model_validation_status is 'VALIDATED'
), registers the model pipeline created
by the train step to the MLflow Model Registry. If the model_validation_status
does not indicate that the model
passed validation checks (if model_validation_status is 'REJECTED'
), the model pipeline is not registered to the
MLflow Model Registry.
If the model pipeline is registered to the MLflow Model Registry, a registered_model_version
is produced containing
the model name and the model version.
The register step is configured by the steps.register
section in pipeline.yaml
:
Step artifacts:
registered_model_version
: the MLflow Model Registry ModelVersion
registered in this step.After model training, the regression pipeline provides the capability to score new data with the trained model.
The ingest scoring step, defined in the data_scoring
section in pipeline.yaml
,
specifies the dataset used for batch scoring and has the same API as the ingest step.
Step artifacts:
ingested_scoring_data
: the ingested scoring data as a Pandas DataFrame.The predict step uses the model registered by the register step to score the
ingested dataset produced by the ingest scoring step and writes the resulting
dataset to the specified output format and location. To fix a specific model for use in the predict
step, provide its model URI as the model_uri
attribute of the pipeline.yaml
predict step definition.
The predict step is configured by the steps.predict
section in pipeline.yaml
:
Step artifacts:
scored_data
: the scored dataset, with model predictions under the prediction
column, as a Pandas DataFrame.The MLflow Tracking server can be configured to log MLflow runs to a specific server. Tracking information is specified
in the profile configuration files - profiles/local.yaml
if running locally and profiles/databricks.yaml
if running on Databricks.
Configuring a tracking server is optional. If this configuration is absent, the default experiment will be used.
Tracking information is configured with the experiment
section in the profile configuration:
To register trained models to the MLflow Model Registry, further configuration may be required. If unspecified, models will be logged to the same server as specified in the tracking URI.
To register models to a different server, specify the desired server in the model_registry
section in the profile configuration:
Evaluation metrics calculate model performance against different datasets. The metrics defined in the pipeline will be calculated as part of the training and evaluation steps, and calculated values will be recorded in each step’s information card.
This regression pipeline features a set of built-in metrics, and supports user-defined metrics as well.
The primary evaluation metric is the one that will be used to select the best performing model in the MLflow UI as
well as in the train and evaluation steps. This can be either a built-in metric or a custom metric (see below).
Models are ranked by this primary metric.
Metrics are configured under the metrics
section of pipeline.yaml
, according to the following specification:
Note that each metric specifies a boolean value greater_is_better
, which indicates whether a higher value for that
metric is associated with better model performance.
The following metrics are built-in. Note that greater_is_better = False
for all these metrics:
mean_absolute_error
mean_squared_error
root_mean_squared_error
max_error
mean_absolute_percentage_error
Custom evaluation metrics define how trained models should be evaluated against custom criteria not captured by
built-in sklearn
evaluation metrics.
Custom evaluation metric functions should be defined in steps/custom_metrics.py
.
Each should accept two parameters:
eval_df
: DataFrame.
A Pandas DataFrame containing two columns:
prediction
: Predictions produced by submitting input data to the model.target
: Corresponding target truth values.builtin_metrics
: Dict[str, int]
.
The built-in metrics calculated during model evaluation. Maps metric names to corresponding scalar values.
The custom metric function should return a Dict[str, int]
, mapping custom metric names to corresponding scalar metric values.
Custom metrics are specified as a list under the metrics.custom
key in pipeline.yaml
, specified as follows:
name
: string. Required.
Name of the custom metric. This will be the name by which you refer to this metric when including it in model evaluation or model training.
function
: string. Required. Specifies the function this custom metric refers to.
greater_is_better
: boolean. Required. Boolean indicating whether a higher metric value indicates better model
performance.
An example custom metric configuration is as follows:
custom:
- name: weighted_mean_square_error
function: steps.custom_metrics.get_custom_metrics
greater_is_better: True