Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, I built and optimized an Azure ML pipeline by using the Python SDK and a provided Scikit-learn model which resulted in Logistic regression as my best model for both AutoML and HyperDrive. Both models were compared using their metrics and then analyzed.

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..."

I used the bank_marketing dataset which contains 21 columns and 10000 rows. My focus was on the age group and predicting the models through hyperparameter tuning and then comparing them

In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..."

The best model was chosen after training automl using train.py. I set up the hyperdrive config as well as the compute cluster and then compared both models for accuracy.

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.

I decided to choose Random Sampler because it identifies hyperparameter by doing a random selection

What are the benefits of the parameter sampler you chose?

What are the benefits of the early stopping policy you chose?

The runs were set up for 3 hours and I had to adopt early stopping to save time and cost.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML.

Attached below the VotingEnsemble classifier used during my runs. With an accuracy of 90% from 25 models from AutoML and Hyperdrive.

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?

Future work

An improvement will be achieved by choosing a different set of parameters and metrics. This will result in a different outcome in metrics like weighted AUC, Precision, Recall etc. Below is the result from my runs.

Proof of cluster clean up

I did not delete my cluster and left it as it is on the workspace. Attached image below

lara-hash / Optimizing-ML-Pipeline

updated project-optimizing ml pipeline #1