learn-co-curriculum / dsc-2-13-15-linalg-regression-lab

Other
0 stars 5 forks source link

Regression with Linear Algebra - Lab

Introduction

In this lab, we shall apply regression analysis using simple matrix manipulations to fit a model to given data, and then predict new values for previously unseen data. We shall follow the approach highlighted in previous lesson where we used numpy to build the appropriate matrices and vectors and solve for the $\beta$ (unknown variables) vector. The beta vector will be used with test data to make new predictions. We shall also evaluate how good our model fit was.

In order to make this experiment interesting. We shall use NumPy at every single stage of this experiment i.e. loading data, creating matrices, performing test train split, model fitting and evaluations.

Objectives

You will be able to:

First let's import necessary libraries

import csv # for reading csv file
import numpy as np

Dataset

The dataset we will use for this experiment is "Sales Prices in the City of Windsor, Canada", something very similar to the Boston Housing dataset. This dataset contains a number of input (independent) variables, including area, number of bedrooms/bathrooms, facilities(AC/garage) etc. and an output (dependent) variable, price. We shall formulate a linear algebra problem to find linear mappings from input to out features using the equation provided in the previous lesson.

This will allow us to find a relationship between house features and house price for the given data, allowing us to find unknown prices for houses, given the input features.

A description of dataset and included features is available at THIS LINK

In your repo, the dataset is available as windsor_housing.csv containing following variables:

there are 11 input features (first 11 columns):

lotsize bedrooms    bathrms stories driveway    recroom fullbase    gashw   airco   garagepl    prefarea

and 1 output feature i.e. price (12th column).

The focus of this lab is not really answering a preset analytical question, but to learn how we can perform a regression experiment, similar to one we performed in statsmodels, using mathematical manipulations. So we we wont be using any Pandas or statsmodels goodness here. The key objectives here are to a) understand regression with matrix algebra, and b) Mastery in NumPy scientific computation.

Stage 1: Prepare Data for Modeling

Let's give you a head start by importing the dataset. We shall perform following steps to get the data ready for analysis:

NOTE: read.csv() would read the csv as a text file, so we must convert the contents to float at some stage.

# Your Code here

# First 5 rows of raw data 

# array([[1.00e+00, 5.85e+03, 3.00e+00, 1.00e+00, 2.00e+00, 1.00e+00,
#         0.00e+00, 1.00e+00, 0.00e+00, 0.00e+00, 1.00e+00, 0.00e+00,
#         4.20e+04],
#        [1.00e+00, 4.00e+03, 2.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
#         0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
#         3.85e+04],
#        [1.00e+00, 3.06e+03, 3.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
#         0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
#         4.95e+04],
#        [1.00e+00, 6.65e+03, 3.00e+00, 1.00e+00, 2.00e+00, 1.00e+00,
#         1.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
#         6.05e+04],
#        [1.00e+00, 6.36e+03, 2.00e+00, 1.00e+00, 1.00e+00, 1.00e+00,
#         0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00,
#         6.10e+04]])

Step 2: Perform a 80/20 test train Split

Explore NumPy's official documentation to manually split a dataset using numpy.random.shuffle(), numpy.random.permutations() or using simple resampling method.

# Your code here 

# Split results

# Raw data Shape:  (546, 13)
# Train/Test Split: (437, 13) (109, 13)
# x_train, y_train, x_test, y_test: (437, 12) (437,) (109, 12) (109,)

Step 3: Calculate the beta

With our X and y in place, We can now compute our beta values with x_train and y_train as:

$\beta$ = (x_trainT . x_train)-1 . x_trainT . y_train

# Your code here 

# Calculated beta values

# [-3.07118956e+03  2.13543921e+00  4.04283395e+03  1.33559881e+04
#   5.75279185e+03  7.82810082e+03  3.73584043e+03  6.51098935e+03
#   1.28802060e+04  1.09853850e+04  6.14947126e+03  1.05813305e+04]
[7.13503312e+03 3.14678249e+00 3.28225971e+02 1.28839251e+04
 6.68332755e+03 3.44682885e+03 3.19440776e+03 5.07824499e+03
 1.32822228e+04 1.10098716e+04 2.88253206e+03 9.16916600e+03]

Step 4: Make Predictions

Great , we now have a set of coefficients that describe the linear mappings between X and y. We can now use the calculated beta values with the test datasets that we left out to calculate y predictions. For this we need to perform the following tasks:

Now we shall all features in each row in turn and multiply it with the beta computed above. The result will give a prediction for each row which we can append to a new array of predictions.

$\hat{y}$ = x.$\beta$ = $\beta$0 + $\beta$1 . x1 + $\beta$2 . x2 + .. + $\beta$m . xm

# Your code here 

Step 5: Evaluate Model

Visualize Actual vs. Predicted

This is exciting, so now our model can use the beta value to predict the price of houses given the input features. Let's plot these predictions against the actual values in y_test to see how much our model deviates.

# Plot predicted and actual values as line plots

This doesn't look so bad, does it ? Our model, although isn't perfect at this stage, is making a good attempt to predict house prices although a few prediction seem a bit out. There could a number of reasons for this. Let's try to dig a bit deeper to check model's predictive abilities by comparing these prediction with actual values of y_test individually. That will help us calculate the RMSE value (Root Mean Squared Error) for our model.

Root Mean Squared Error

Here is the formula for this again.

# Calculate RMSE

# Due to random split, your answers may vary 

# RMSE = 16401.913562758735

Normalized Root Mean Squared Error

The above error is clearly in terms of the dependant variable i.e. the final house price. We can also use a normlized mean squared error in case of multiple regression which can be calculated from RMSE using following formula:

# Calculate NRMSE

# Due to random split, your answers may vary 

# 0.09940553674399233
0.09940553674399233

SO there it is. A complete multiple regression analysis using nothing but numpy. Having good programming skills in numpy would allow to dig deeper into analytical algorithms in machine learning and deep learning. Using matrix multiplication techniques we saw here, we can easily build a whole neural network from scratch.

Level up - Optional

Summary

So there we have it. A predictive model for predicting house prices in a given dataset. Remember this is a very naive implementation of regression modeling. The purpose here was to get an introduction to the applications of linear algebra into machine learning and predictive analysis. We still have a number of shortcomings in our modeling approach and we can further apply a number of data modeling techniques to improve this model.