oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Expected Contingency Tables #412

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

We calculated the marginal proportions for the leader and influence questions. In order to understand whether these questions are associated, we can use the marginal proportions to create a contingency table of expected proportions if there were no association between these variables. To calculate these expected proportions, we need to multiply the marginal proportions for each combination of categories:

leader = no leader = yes
influence = no 0.484*0.388 = 0.188 0.516*0.388 = .200
influence = yes 0.484*0.612 = 0.296 0.516*0.612 = 0.315
oldoc63 commented 1 year ago

These proportions can then be converted to frequencies by multiplying each one by the sample size (11097 for this data):

leader = no leader = yes
influence = no 0.188*11097 = 2087 0.200*11097 = 2221
influence = yes 0.296*11097 = 3288 0.315*11097 = 3501

This table tells us that if there were no association between the leader and influence questions, we would expect 2087 people to answer no to both.

oldoc63 commented 1 year ago

In python, we can calculate this table using the chi2_contingency() function from SciPy, by passing in the observed frequency table. There are actually four outputs from this function, but for now, we'll only look at the fourth one:

oldoc63 commented 1 year ago

Note that the SciPy function returned the same expected frequencies as we calculated "by hand" above! Now that we have the expected contingency table if there's no association, we can compare it to our observed contingency table. Use np.round() to print out the expected contingency table, with values rounded to the nearest whole number. Compare this to the observed frequency table. How much do the numbers in these tables differ?

oldoc63 commented 1 year ago

The more that the expected and observed tables differ, the more sure we can be that the variables are associated. In this example, we see some pretty big differences (e.g., 3015 in the observed table compared to 2087 in the expected table). This provides additional evidence that these variables are associated.

oldoc63 commented 1 year ago

The contingency table of frequencies for the special and authority questions is saved in the special_authority_freq variable. Use the chi2_contingency() function to calculate the expected frequency table for these two questions if there were no association. Save the result as expected.

oldoc63 commented 1 year ago

np.round() was used to print out the expected contingency table, with values rounded to the nearest whole number. Compare this to the observed frequency table. How much do the numbers in these tables differ?