oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Relationship between variables #400

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

Examining the relationship between variables can give us key insight into our data. We will cover ways of accessing the association between a quantitative variable and a categorical variable.

We'll explore a dataset that contains the following information about students at two portuguese schools:

Suppose we want to know: Is a student's score (G3) associated with their school (school)? If so, then knowing what school a student attends give us information about what their score is likely to be. For example, maybe students at one of the schools consistently score higher than students at the other school.

oldoc63 commented 1 year ago

The dataset described above has been saved for you in the workspace as a Pandas dataframe named students. Inspect the first five rows of students using the .head() method. Take a look at the other columns. Which are categorical and which are quantitative?

oldoc63 commented 1 year ago

Suppose that we want to know whether there is an association between student math scores (G3) and the student's address (urban or rural). Separate out G3 scores into two separate lists: one for students who live in an urban location ('U') and one for students who live in a rural location ('R'). Name these lists scores_urban and scores_rural.

oldoc63 commented 1 year ago

Mean and Median Differences

We began investigating whether or not there is an association between math scores and the school a student attends. We can begin quantifying this association by using two common summary statistics, mean and median differences. To calculate the difference in mean G3 scores for the two schools, we can start by finding the mean math score for students at each school. We can then find the difference between them:

oldoc63 commented 1 year ago

We see that the mean math score for students at GP is 10.49, while the mean score for students at MS is 9.85. The mean difference is 0.64. We can follow a similar process to calculate a median difference:

oldoc63 commented 1 year ago

GP students also have a higher median score, by one point. Highly associated variables tend to have a large mean or median difference.

oldoc63 commented 1 year ago

Side by Side Box Plots

The difference in mean math scores for students at GP and MS was 0.64. How do we know whether this difference is considered small or large? To answer this question we need to know something about the spread of the data.

One way to get a better sense of spread is by looking at a visual representation of the data. Side by side box plots are useful in visualizing mean and median differences because they allow us to visually estimate the variation in the data. They can help us to determine if mean or median differences are "large" or "small".

Let's take a look at side by side boxplots of math scores at each school:

oldoc63 commented 1 year ago

Looking at the plot, we can clearly see that there is a lot of overlap between the boxes (i.e., the middle 50% of the data). Therefore, we can be more confident that there is not much difference between the math scores of the two groups.

oldoc63 commented 1 year ago

Generate side by side boxplots for students scores (G3) by address. Is there any overlap between the boxes? Do you think the variables are associated?

oldoc63 commented 1 year ago

Inspecting Overlapping Histograms

Another way to explore the relationship between a quantitative and categorical variable in more detail is by inspecting overlapping histograms. In the code below, setting alpha=0.5 ensures that the histograms are see-through enough that we can see both of them at once. We have also used normed=True make sure that the y-axis is a density rather than a frequency (note: the newest version of matplotlib renamed this parameter density instead of normed):

oldoc63 commented 1 year ago

By inspecting this histogram, we can clearly see that the entire distribution of scores at GP (not just the mean or median) appears slightly shifted to the right, compared to the scores at MS. However, there is also still a lot of overlap between the scores, suggesting that the association is relatively weak. Note that there are only 46 students at MS, but there are 349 students at GP. If we hadn't used density=True, a fairly comparison between this two populations would be impossible.

Image

oldoc63 commented 1 year ago

While overlapping histograms and side by side boxplots can convey similar information, histograms give us more detail and can be useful in spotting patterns that were not visible in a box plot (e.g., a bimodal distribution). For example, the following set of box plots and overlapping histograms illustrate the same hypothetical data:

Image

oldoc63 commented 1 year ago

While the box plots and means/medians appear similar, the overlapping histograms illuminate the differences between these two distributions of scores.

oldoc63 commented 1 year ago

Your list from the previous exercise (scores_urban and scores_rural) have been created for you in script.py. Use them to create an overlaid histogram of scores for students who live in urban an rural areas. Remember to use different colors for each histogram, set density=True, alpha=0.5, and use the labels 'Urban' and 'Rural', respectively.

oldoc63 commented 1 year ago

twoFeatures.pdf

oldoc63 commented 1 year ago

Exploring Non-Binary Categorical Variables

In each of the previous exercises, we assessed whether there was an association between a quantitative variable (math scores) and a BINARY categorical variable (school). The categorical variable is considered binary because there are only two available options, either MS or GP. However, sometimes we are interested in an association between a quantitative variable and non-binary categorical variable. Non-binary categorical variables have more than two categories.

When looking at an association between a quantitative variable and a non-binary categorical variable, we must examine all pair-wise differences. For example, suppose we want to know whether or not an association exists between math scores (G3) and (Mjob), a categorical variable representing the mother's job. This variable has five possible categories: at_home, health, services, teacher, or other. There actually 10 different comparisons that we can make. For example, we can compare scores for students whose mothers work at home or in health; at_home or other; ect.. The easiest way to quickly visualize these comparisons is with side-by-side box plots:

oldoc63 commented 1 year ago

Visually, we need to compare each box to every other box. While most of these boxes overlap with each other, there are some pairs for which there are some apparent differences. For, example, scores appear to be higher among students with mothers working in health than among students with mothers working at home or in an 'other' job. If there are any pair wise differences, we can say that the variables are associated; however, it is more useful to specifically report which groups are different.

oldoc63 commented 1 year ago

Create a side-by-side boxplot to assess whether there is an association between students' math score (G3) and their fathers' job (FJob). Do you think there is an association between these variables? For which pairs of groups do you see differences?

oldoc63 commented 1 year ago

Review

We use summary statistics and data visualization tools to examine an association between a quantitative and categorical variable. More specifically:

After calculating a mean or median difference and visually comparing distributions, the next step might be to run a hypothesis test to look for evidence of population-level differences (will a similar difference in scores be observed for ALL students who ever attend these schools?). Now that you know how to investigate whether variables are associated, you can use these techniques to explore associations on more datasets.