Assumptions of linear regression

oldoc63 / learningDS

Learning DS with Codecademy and Books

0 stars 0 forks source link

Assumptions of linear regression #471

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

There are a number of assumptions of simple linear regression, which are important to check if you are fitting a linear model. The first assumption is that the relationship between the outcome variable and predictor is linear (can be described by a line). We can check this before fitting the regression by simply looking at a plot of the two variables.

The next two assumptions (normality and homoscedasticity) are easier to check after fitting the regression. But first, we need to calculate two things: fitted values and residuals.

Again consider our regression model to predict weight based on height (model formula 'weight ~ height). The fitted values are the predicted weights for each person in the dataset that was used to fit the model, while the residuals are the differences between the predicted weight and the true weight for each person.

oldoc63 commented 1 year ago

We can calculate the fitted values using .predict() by passing in the original data. The result is a pandas series containing predicted values for each person in the original dataset:

fitted_values = results.predict(body_measurements)
print(fitted_values.head())

0    66.673077
1    59.100962
2    71.721154
3    70.711538
4    65.158654
dtype: float64

The residuals are the differences between each of these fitted values and the true values of the outcome variable. They can be calculated by subtracting the fitted values from the actual values. We can perform this element-wise subtraction in Python by simply subtracting one Python series from the other, as shown below:

residuals = body_measurements.weight - fitted_values
print(residuals.head())

0   -2.673077
1   -1.100962
2    3.278846
3   -3.711538
4    2.841346
dtype: float64

oldoc63 commented 1 year ago

script.py already contains the code to fit a model on the students dataset that predicts test score using hours_studied as a predictor. Calculate the fitted values for this model and save them as fitted_values.

oldoc63 commented 1 year ago

Calculate the residuals for this model and save the result as residuals.

oldoc63 commented 1 year ago

Print out the first 5 values in residuals and inspect them. What is the difference between a positive and negative residual? These numbers tell us how far the true test students are from the predicted test students based on the model. A positive residual means that the student scored higher than predicted based on the model; a negative residual means that the student scored lower than predicted.