oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Categorical Predictors #473

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

In the previous exercises, we used a quantitative predictor in our linear regression, but it's important to note that we can also use categorical predictors. The simplest case of a categorical predictor is a binary variable (only two categories).

For example, suppose we surveyed 100 adults and asked them to report their height in cm and whether or not they play basketball. We've coded the vaiable bball_player so that it is equal to $1$ if the person plays basketball and $0$ if they do not.

Image

oldoc63 commented 1 year ago

We see that people who play basketball tend to be taller than people who do not. Just like before, we can draw a line to fit these points. The best line for this plot is the one that goes through the mean height for each group. To recreate the scatter plot with the best fit line, we could use the following code:

#Calculate group means
print(data.groupby('play_bball').mean().height)
play_bball
--
0 | 169.016
1 | 183.644
# Create scatter plot
plt.scatter(data.play_bball, data.height)

# Add the line using calculated group means
plt.plot([0,1], [169.016, 183.664])

# Show the plot
plt.show()

Image

oldoc63 commented 1 year ago
  1. Using the dataset students, plot a scatter plot of score (y-axis) against breakfast (x-axis) to see scores for students who did and did not eat breakfast.
oldoc63 commented 1 year ago
  1. Calculate the mean test score for students who ate breakfast and the mean score for students who did not eat breakfast. Use these numbers to plot the best-fit line on top of the scatter plot.
oldoc63 commented 1 year ago

Fit and Interpretation

Now that we've seen what a regression model with a binary predictor looks like visually, we can actually fit the model using statsmodels.api.OLS.from_formula(), the same way we did for a quantitative predictor:

model = sm.OLS.from_formula('height ~ play_bball', data)
results = model.fit()
print(results.params)
Intercept     169.016
play_bball     14.628
dtype: float64

Note that this will work if the play_bball variable is coded with 0 and 1, but it will also work if it is coded with True and False, or even if it is coded with strings like 'yes' and 'no' (in this case, the coefficient label will look something like play_bball[T.yes] in the params output, indicating that 'yes' corresponds to a 1).

To interpret this output, we first need to remember that the intercept is the expected value of the outcome variable when the predictor is equal to zero. In this case, the intercept is therefore the mean height of non-basketball players.

The slope is the expected difference in the outcome variable for a one unit difference in the predictor variable. In this case, a one unit difference in play_bball is the difference between not being a basketball player and being a basketball player. Therefore, the slope is the difference in mean heights for basketball players and non-basketball players.

oldoc63 commented 1 year ago
  1. Create and fit a regression model of score predicted by breakfast using sm.OLS.from_formula() and print out the coefficients.
oldoc63 commented 1 year ago
  1. Calculate the mean test score for students who ate breakfast and the mean score for students who did not eat breakfast. Calculate and print the difference in mean scores. How this number relates to the regression output?