oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Correlation #406

Open oldoc63 opened 2 years ago

oldoc63 commented 2 years ago

Like covariance, Pearson Correlation (often referred to simply as "correlation") is a scaled form of covariance. It also measures the strength of a linear relationship, but ranges from -1 to +1, making it more interpretable.

Highly associated variables with a positive linear relationship will have a correlation close to 1. Highly associated variables with a negative linear relationship will have a correlation close to -1. Variables that do not have a linear association (or a linear association with a slope of zero) will have correlations close to 0.

Image

oldoc63 commented 2 years ago

The pearsonr() function from scipy.stats can be used to calculate correlation:

oldoc63 commented 2 years ago

Generally, a correlation larger than about .3 indicates a linear association. A correlation greater than about .6 suggestions a strong linear association.

oldoc63 commented 2 years ago

Use the pearsonr function from scipy.stats to calculate the correlation between sqfeet and beds. Store the result in a variable named corr_sqfeet_beds and print out the result. How strong is the linear association between these variables?

oldoc63 commented 2 years ago

It's important to note that there are some limitations to using correlation or covariance as a way of assessing whether there is an association between two variables. Because correlation and covariance both measure the strength of linear relationships with non-zero slopes, but no other kinds of relationships, correlation can be misleading.

The four scatter plots below all show pairs of variables with near-zero correlations. The bottom left image shows an example of a perfect linear association where the slope is zero (the line is horizontal). Meanwhile, the other three plots show non-linear relationships -if we drew a line through any of these sets of points, that line would need to be curved, not straight!

Image

oldoc63 commented 2 years ago

A simulated dataset named sleep has been loaded. The hypothetical data contains two columns:

Create a scatter plot of hours_sleep (on the x axis) and performance (on the y axis).

oldoc63 commented 2 years ago

Calculate the correlation for hours_sleep and performance an save the results as corr_sleep_performance. Print it out. Does the correlation accurately reflect the strength of the relationship between these variables?

The correlation is only 0.28 (a relatively small correlation), even though the variables seem to be clearly associated (there is a very clear pattern in the scatter plot).