Specifying the direction of the difference in a two-proportion test (theoretical normal model)

VectorPosse commented 2 years ago

Feature

In a two-proportion test, using a theoretical normal distribution, there seems to be no way to specify the order of the factors for the explanatory variable (i.e., the direction of the difference).

Here is the sample code from the tutorial:

null_dist_theory <- gss %>%
  specify(college ~ sex, success = "no degree") %>%
  assume("z")

It's not clear in the code whether the z score obtained would be for males - females or females - males.

simonpcouch commented 2 years ago

Thanks for the issue!

I’ve been away from thinking about this package regularly for a good bit, so would especially welcome other’s thoughts here.

My first impression:

library(infer)

You’re correct that, to define the theoretical null distribution, we don’t have an interface for supplying an order argument. To carry out the test, though, which in this package is loosely defined by the juxtaposition of a distribution with a test statistic, the user will be prompted if they don’t supply the order.

To calculate that test statistic:

# without order
z_hat <- gss %>% 
   specify(college ~ sex, success = "no degree") %>%
   hypothesize(null = "independence") %>%
   calculate(stat = "z")
#> Warning: The statistic is based on a difference or ratio; by default, for
#> difference-based statistics, the explanatory variable is subtracted in the order
#> "male" - "female", or divided in the order "male" / "female" for ratio-based
#> statistics. To specify this order yourself, supply `order = c("male", "female")`
#> to the calculate() function.

# with order
z_hat <- gss %>% 
   specify(college ~ sex, success = "no degree") %>%
   hypothesize(null = "independence") %>%
   calculate(stat = "z", order = c("female", "male"))

So the order is indeed made explicit in the testing pipeline.

I’m not sure that we otherwise need an interface to supply order here, as the portion of the distribution that assume specifies doesn’t depend on order with null = "independence".

i.e. we should probably write, in the documentation you’ve pointed to:

# this way
null_dist_theory <- gss %>%
   specify(college ~ sex, success = "no degree") %>%
   hypothesize(null = "independence") %>%
   assume("z")

# rather than this way
null_dist_theory <- gss %>%
   specify(college ~ sex, success = "no degree") %>%
   assume("z")

…to make it explicit that the null hypothesis is that e.g. college and sex are independent, so p₁ − p₂ = 0 = p₂ − p₁. The same isn’t necessarily true for p̂₁ and p̂₂, though, and the order for their subtraction is specified where it’s needed.

^{Created on 2022-09-19 by the reprex package (v2.0.1)}

VectorPosse commented 2 years ago

Those are all good points. I was concerned about students interpreting one-sided P-values. But you correctly point out that when they calculate the test statistic, they specified an order, and the P-value will reflect that choice through the obs_stat argument.

Thanks!

On Mon, Sep 19, 2022 at 2:17 PM Simon P. Couch @.***> wrote:

Thanks for the issue!

I’ve been away from thinking about this package regularly for a good bit, so would especially welcome other’s thoughts here.

My first impression:

library(infer)

You’re correct that, to define the theoretical null distribution, we don’t have an interface for supplying an order argument. To carry out the test, though, which in this package is loosely defined by the juxtaposition of a distribution with a test statistic, the user will be prompted if they don’t supply the order.

To calculate that test statistic:

without order

z_hat <- gss %>%

specify(college ~ sex, success = "no degree") %>%

hypothesize(null = "independence") %>%

calculate(stat = "z")

> Warning: The statistic is based on a difference or ratio; by default, for

> difference-based statistics, the explanatory variable is subtracted in the order

> "male" - "female", or divided in the order "male" / "female" for ratio-based

> statistics. To specify this order yourself, supply order = c("male", "female")

> to the calculate() function.

with order

z_hat <- gss %>%

specify(college ~ sex, success = "no degree") %>%

hypothesize(null = "independence") %>%

calculate(stat = "z", order = c("female", "male"))

So the order is indeed made explicit in the testing pipeline.

I’m not sure that we otherwise need an interface to supply order here, as the portion of the distribution that assume specifies doesn’t depend on order with null = "independence".

i.e. we should probably write, in the documentation you’ve pointed to:

this way

null_dist_theory <- gss %>%

specify(college ~ sex, success = "no degree") %>%

hypothesize(null = "independence") %>%

assume("z")

rather than this way

null_dist_theory <- gss %>%

specify(college ~ sex, success = "no degree") %>%

assume("z")

…to make it explicit that the null hypothesis is that e.g. college and sex are independent, so p1 − p2 = 0 = p2 − p1. The same isn’t necessarily true for p̂1 and p̂2, though, and the order for their subtraction is specified where it’s needed.

Created on 2022-09-19 by the reprex package https://reprex.tidyverse.org (v2.0.1)

— Reply to this email directly, view it on GitHub https://github.com/tidymodels/infer/issues/460#issuecomment-1251505507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3M3H377JBWDKQGZ3QXB4DV7DC4VANCNFSM6AAAAAAQPDAIXI . You are receiving this because you authored the thread.Message ID: @.***>

simonpcouch commented 2 years ago

Awesome! Will go ahead and close this, then. Thanks again for the issue.🙂

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

tidymodels / infer