Open oldoc63 opened 1 year ago
We see that people who play basketball tend to be taller than people who do not. Just like before, we can draw a line to fit these points. The best line for this plot is the one that goes through the mean height for each group. To recreate the scatter plot with the best fit line, we could use the following code:
#Calculate group means
print(data.groupby('play_bball').mean().height)
play_bball
--
0 | 169.016
1 | 183.644
# Create scatter plot
plt.scatter(data.play_bball, data.height)
# Add the line using calculated group means
plt.plot([0,1], [169.016, 183.664])
# Show the plot
plt.show()
Now that we've seen what a regression model with a binary predictor looks like visually, we can actually fit the model using statsmodels.api.OLS.from_formula()
, the same way we did for a quantitative predictor:
model = sm.OLS.from_formula('height ~ play_bball', data)
results = model.fit()
print(results.params)
Intercept 169.016
play_bball 14.628
dtype: float64
Note that this will work if the play_bball variable is coded with 0 and 1, but it will also work if it is coded with True and False, or even if it is coded with strings like 'yes' and 'no' (in this case, the coefficient label will look something like play_bball[T.yes]
in the params output, indicating that 'yes' corresponds to a 1).
To interpret this output, we first need to remember that the intercept is the expected value of the outcome variable when the predictor is equal to zero. In this case, the intercept is therefore the mean height of non-basketball players.
The slope is the expected difference in the outcome variable for a one unit difference in the predictor variable. In this case, a one unit difference in play_bball is the difference between not being a basketball player and being a basketball player. Therefore, the slope is the difference in mean heights for basketball players and non-basketball players.
In the previous exercises, we used a quantitative predictor in our linear regression, but it's important to note that we can also use categorical predictors. The simplest case of a categorical predictor is a binary variable (only two categories).
For example, suppose we surveyed 100 adults and asked them to report their height in cm and whether or not they play basketball. We've coded the vaiable bball_player so that it is equal to $1$ if the person plays basketball and $0$ if they do not.