Open lara-hash opened 3 years ago
This project is part of the Udacity Azure ML Nanodegree. In this project, I built and optimized an Azure ML pipeline by using the Python SDK and a provided Scikit-learn model which resulted in Logistic regression as my best model for both AutoML and HyperDrive. Both models were compared using their metrics and then analyzed.
In 1-2 sentences, explain the problem statement: e.g "This dataset contains data about... we seek to predict..."
I used the bank_marketing dataset which contains 21 columns and 10000 rows. My focus was on the age group and predicting the models through hyperparameter tuning and then comparing them
In 1-2 sentences, explain the solution: e.g. "The best performing model was a ..."
The best model was chosen after training automl using train.py. I set up the hyperdrive config as well as the compute cluster and then compared both models for accuracy.
Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm.
I decided to choose Random Sampler because it identifies hyperparameter by doing a random selection
What are the benefits of the parameter sampler you chose?
What are the benefits of the early stopping policy you chose?
The runs were set up for 3 hours and I had to adopt early stopping to save time and cost.
In 1-2 sentences, describe the model and hyperparameters generated by AutoML.
Attached below the VotingEnsemble classifier used during my runs. With an accuracy of 90% from 25 models from AutoML and Hyperdrive.
Compare the two models and their performance. What are the differences in accuracy? In architecture? If there was a difference, why do you think there was one?
An improvement will be achieved by choosing a different set of parameters and metrics. This will result in a different outcome in metrics like weighted AUC, Precision, Recall etc. Below is the result from my runs.
I did not delete my cluster and left it as it is on the workspace. Attached image below
updated project 1-optimizing ml pipeline