oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

The Chi-Square Statistic #413

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

In the previous exercise, we calculated a contingency table of expected frequencies if there were no association between the leader and influence questions. We then compared this to the observed contingency table. Because the tables looked somewhat different, we concluded that responses to these questions are probably associated.

While we can inspect these values visually, many data scientist use the Chi-Square statistic to summarize how different these two tables are. To calculate the Chi Square statistic, we simply find the squared difference between each value in the observed table and it's corresponding value in the expected table; then add up those numbers:

$$ ChiSquare = sum((observed - expected)^2) $$

The Chi-Square statistic is also the first output of the SciPy function chi2_contingency():

oldoc63 commented 1 year ago

The interpretation of the Chi-Square statistic is dependent on the size of the contingency table. For a 2x2 table (like the one we've been investigating), a Chi-Square statistic larger than around 4 would strongly suggest an association between the variables. In this example, our Chi-Square statistic is much larger than that - 1307.88! This add to our evidence that the variables are highly associated.

oldoc63 commented 1 year ago

Use the chi2_contingency() function to calculate Chi-Square statistic for the special_authority_freq table. Save the result as chi2 and print it out. Do these variables appear to be associated?

oldoc63 commented 1 year ago

Review

We use a few different methods to assess whether there was an association between two categorical variables. Although we used binary variables (only 2 options per category), it is important to note that the same techniques can be used for non-binary categorical variables. The methods we used included: