moderndive / ModernDive_book

Statistical Inference via Data Science: A ModernDive into R and the Tidyverse
https://www.moderndive.com/
Other
759 stars 491 forks source link

Learning Check 9.1 - truly no solution? #392

Closed wjhopper closed 3 years ago

wjhopper commented 4 years ago

I'll be embarrassed if I'm wrong, but I don't think it's possible to answer Learning Check 9.1 (or at least, it's not possible with infer and there isn't really a meaningful answer in general).

Learning Check 9.1 asks you to "Conduct the same hypothesis test and confidence interval analysis comparing male and female promotion rates using the median rating instead of the mean rating. What was different and what was the same?"

So, I simply changed the requested statistic in calculate() to "diff in medians" instead of "diff in means", and hit the run button:

null_distribution <- promotions %>%
    specify(formula = decision ~ gender, success = "promoted") %>% 
    hypothesize(null = "independence") %>% 
    generate(reps = 1000, type = "permute") %>% 
    calculate(stat = "diff in medians", order = c("male", "female"))

which gives

Error: The response variable of `decision` is not appropriate
since 'diff in medians' is expecting the response variable to be numeric.

Which I realized, seems like a sensible error. Even if calculate() was set up to take the median of a binary variable, the median would most likely be a 0 or 1 (it would only be .5 in the case of an even sample size and an equal number of observations in each category), meaning the null distribution of this statistics is not particularly useful for us. I also noticed there was no solution for 9.1 in the appendix, which made me think I was not crazy.

I think maybe this question got mixed up with Learning Check 9.9, which asks you to "Conduct the same analysis comparing action movies versus romantic movies using the median rating instead of the mean rating. What was different and what was the same?" In this case, ratings is a numeric variable (or at least, we can abuse it like one :laughing:) so "diff in medians" is a sensible null distribution to ask about.

ismayc commented 4 years ago

Hey @wjhopper. This question is in fact impossible and is a remnant of before we had this proportion example problem in there. In fact, you'll get the same error as what you have shown if you try to use "diff in means" instead of "diff in medians". Though now you have me thinking that since the mean of a binary variable is just the rate at which a success occurs maybe that error shouldn't be there and could be tweaked... Nah, I think it's better to just have this be a "diff in props" problem.

library(infer)
library(moderndive)

null_distribution_mean <- promotions %>%
  specify(formula = decision ~ gender, success = "promoted") %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in means", order = c("male", "female"))
#> Error: The response variable of `decision` is not appropriate
#> since 'diff in means' is expecting the response variable to be numeric.

Created on 2020-07-15 by the reprex package (v0.3.0)

I'll change this question around to be a "challenge" question to have people explain why "diff in medians" or "diff in means" won't work for this problem if @rudeboybert approves.

rudeboybert commented 4 years ago

Having this be a challenge question sounds good to me

rudeboybert commented 3 years ago

Fixed in #397