lara-hash / Optimizing-ML-Pipeline

0 stars 0 forks source link

updated project-optimizing ml pipeline #1

Open lara-hash opened 3 years ago

lara-hash commented 3 years ago

updated project 1-optimizing ml pipeline

lara-hash commented 3 years ago

Optimizing an ML Pipeline in Azure

Overview

This project is part of the Udacity Azure ML Nanodegree. In this project, I built and optimized an Azure ML pipeline by using the Python SDK and a provided Scikit-learn model which resulted in Logistic regression as my best model for both AutoML and HyperDrive. Both models were compared using their metrics and then analyzed.

image image image

Summary

In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..."

I used the bank_marketing dataset which contains 21 columns and 10000 rows. My focus was on the age group and predicting the models through hyperparameter tuning and then comparing them

image image

In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..."

The best model was chosen after training automl using train.py. I set up the hyperdrive config as well as the compute cluster and then compared both models for accuracy.

image image image

image image image

Scikit-learn Pipeline

Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.

I decided to choose Random Sampler because it identifies hyperparameter by doing a random selection image

image

image

image

image

What are the benefits of the parameter sampler you chose? image image

What are the benefits of the early stopping policy you chose?

The runs were set up for 3 hours and I had to adopt early stopping to save time and cost.

AutoML

In 1-2 sentences, describe the model and hyperparameters generated by AutoML.

Attached below the VotingEnsemble classifier used during my runs. With an accuracy of 90% from 25 models from AutoML and Hyperdrive.

image image

image

image

image

Pipeline comparison

Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one? image image

image image image image image

Future work

An improvement will be achieved by choosing a different set of parameters and metrics. This will result in a different outcome in metrics like weighted AUC, Precision, Recall etc. Below is the result from my runs.

image

image image

image

image

Proof of cluster clean up

I did not delete my cluster and left it as it is on the workspace. Attached image below

image

image image image