oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Linear Regression at Codecademy Project #475

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

For this project, you'll get to work as a data analyst alongside the curriculum team at Codecademy to help us improve the learner experience. While this data is simulated, it is similar to real data that we might want to investigate as Codecademy team members.

oldoc63 commented 1 year ago
  1. A dataset has been loaded for you in script.py and saved as a dataframe named codecademy. We're imagining that this data was collected as part of an experiment to understand factors that contribute to learner performance on a quiz. The data contains three columns:
    • score: student score on a quiz
    • completed: the number of other content items on Codecademy that the learner has completed prior to this quiz
    • lesson: indicates which lesson the learner took directly before the quiz (Lesson A or Lesson B)

Take a look at this dataset by printing the first five rows.

oldoc63 commented 1 year ago

Model the relationship between quiz score and number of completed content items

  1. Plot a scatter plot of score (y-axis) against completed (x-axis) to see the relationship between quiz score and number of completed content items. Make sure to show, then clear the plot. Is there a relationship between these two variables, and does it appear to be linear?

    Use plt.scatter() to create a scatter plot. The firs argument is the x-variable (codecademy.completed) and the second argument is the y-variable (codecademy.score). After calling plt.scatter(), use two lines of code to show, then clear the plot.

oldoc63 commented 1 year ago
  1. Create and fit a linear regression model that predicts score using complete as the predictor. Print out the regression coefficients.
oldoc63 commented 1 year ago
  1. Write a one sentence (each) interpretation of the slope and intercept that you printed out in the previous step. Make sure to comment out the interpretation so your code still runs.

    The intercept is the expected value of the outcome variable when the predictor variable is equal to 0. The slope is the expected difference of the outcome variable for a one unit increase in the predictor variable.

oldoc63 commented 1 year ago
  1. Plot the same scatter plot that you made earlier (with score on the y-axis and completed on the x-axis), but this time add the regression line on top of the plot. Make sure to show, then clear the plot. Do you think this line fits the data well?

    There are a few different ways to accomplish this, but one option is to use plt.plot() to create the line, using the completed column from the original data as the x-coordinates (first argument) and the predicted values of score (based on the model) as the y-coordinates (second argument).

oldoc63 commented 1 year ago
  1. Use your model to calculate the predicted quiz score for a learner who has previously completed 20 other content items.

    One option is to use the .predic() method on your fitted model and pass in a new dataset with completed = 20 newdata = {completed:[20]}

Another option is to use your equation of a line along with the intercept and slope you calculated when you fit the model. The formula looks something like: slope * 20 + intercept.

oldoc63 commented 1 year ago
  1. Calculate the fitted values for your model and save them as fitted_values.

    Use the .predict() method on your fitted model and pass in the data that was used to fit the model.

oldoc63 commented 1 year ago
  1. Calculate the residuals for the model and save the result as residuals.

    Subtract the fitted_values that you calculated in the previous from the true student quiz scores (codecademy.score).

oldoc63 commented 1 year ago
  1. Check the normality assumption for linear regression by plotting a histogram of the residuals.
oldoc63 commented 1 year ago
  1. Check homoscedasticity assumption for linear regression by plotting the residuals (y-axis) against the fitted values (x-axis). Do you see any patterns or is the homoscedasticity assumption met?

    Use plt.scatter() to create the scatter plot and pass in fitted_values as the first argument (x-variable) and residuals as the second argument (y-variable).

oldoc63 commented 1 year ago

Do learners who take lesson A or B perform better on the quiz?

  1. Let's turn our attention to the lessons column to see if learners who took different lessons scored differently on the quiz. Use sns.boxplot to create a boxplot of score (y-axis variable) for each lesson (x-variable) to see the relationship between quiz score and which lesson the learner completed immediately before taking the quiz. Make sure to show, then clear the plot. Does one lesson appear to do a better job than the other of preparing students for this quiz? If so, which one?
oldoc63 commented 1 year ago
  1. Create and fit a linear regression model that predicts score using lesson as the predictor. Print out the regression coefficients.
oldoc63 commented 1 year ago
  1. Calculate and print out the mean quiz scores for learners who took lesson A and lesson B. Calculate and print out the mean difference. Can you see how these numbers relate to the intercept and slope that you printed out in the linear regression output?

    To calculate and print the mean quiz score for learners who took lesson A, you can use the following code: print(np.mean(codecademy.score[codecademy.lesson == 'Lesson A'])) You should find that the intercept from the regression output is equal to the mean score for learners who took lesson A, and the slope is equal to the mean difference.

oldoc63 commented 1 year ago
  1. You've used a simple linear model to understand how quiz scores are related to other learner actions. In this project, we've focused on modeling the relationship between quiz score and one other variable at a time (first we looked at completed, the we looked at lesson separately). The next step in linear regression is to model quiz scores as function of multiple other variables at once. To get a preview of what this may look like visually, let's try using seaborn's lmplot() function to plot a scatter plot of score vs. completed, colored by lesson. For context, the lm in lmplot stands for "linear model". This function will automatically plot a linear regression model on top of the scatter plot. We'll include a third variable in our plot using the hue parameter (which controls the color of each point in the scatter plot). All of a sudden, we end up with multiple regression lines.