Open andrewpbray opened 1 year ago
I dig it! If folks would find this pedagogically useful, I think this is surely within scope and would have a low maintenance burden. :)
I think I can see the value, but I'm having a rough time picturing what procedures would look like based on @andrewpbray's description.
@andrewpbray -- Could you write up a couple of examples as though rep_shuffle_col()
existed? Also, I think the name would need to be something else -- shuffle
is to sample
and slice
is to col
here (though obviously row
would have been better.
rep_slice_sample()
vs. rep_col_shuffle()
rep_slice_sample()
vs. rep_mutate_shuffle()
-- I don't love this at all, but seems more of a parityHere's an example of a permutation test using a difference in means, starting with the existing implementation from full pipeline examples docs.
library(infer)
# existing implementation
null_dist <- gss %>%
specify(age ~ college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("degree", "no degree"))
# new approach (to get through the generate step)
gss %>%
rep_col_shuffle(age, reps = 1000)
where the output of the second pipeline would be a data frame with nrow(gss) * reps
rows and ncol(gss) + 1
columns, the new column being replicate
. In that data frame, age
will now be sample(age)
.
The syntax would be the same for a permutation test for a difference in proportions, the coefficient of a linear model, etc.
If we did a close port of rep_slice_sample()
, then that output data frame wouldn't have any of the metadata normally appended by specify()
and hypothesize()
that is used by calculate()
, so the user would have to use dplyr
to group_by(replicate)
and calculate their statistics. I think that's ok.
This semester I've been seeing how far I can get in terms of simulation-based inference without using the main part of the infer package.
rep_slice_sample()
is all you need to do bootstrapping (and it's also very handy for simulation). I'm curious what y'all think about an analogous function likerep_col_shuffle()
(orrep_shuffle_col()
)?The motivation here is that the default API for infer is based around the formalism of a NHST. These two functions -
rep_slice_sample()
andrep_shuffle_col()
- would allow users (and teachers) to get to through the generate step without the formalism. This helpful for creating a more porous boundary with other forms of simulation; there would be just two fairly generic mechanistically named functions instead of five functions laser focused on the NHST framework.In terms of implementation, it looks like
generate()
takes two paths:rep_slice_sample()
for bootstrapping andpermute()
>permute_once()
>permute_col()
>sample()
for permutations. Seems like the easiest approach would be to just wrappermute()
.Thoughts?