tidymodels / infer

An R package for tidyverse-friendly statistical inference
https://infer.tidymodels.org
Other
726 stars 80 forks source link

`rep_slice_sample()` doesn't work with integer sampling weights #480

Closed andrewpbray closed 1 year ago

andrewpbray commented 1 year ago

I'm trying my hand at teaching an intro course using purely rep_slice_sample() and an as yet to be written rep_shuffle_col() or something of the sort. This morning I ran into an unexpected behavior of rep_slice_sample(), particular since it differs from slice_sample(). I will have time later today to look into a fix but just thought I'd put up an issue to track it.

Shoutout to @mine-cetinkaya-rundel . That function that you wrote is still going strong like 8 years later!

library(tidyverse)
library(infer)

df <- tibble(id = 1:5, 
             letter = factor(c("a", "b", "c", "d", "e")))

# works when weight vec sums to 1
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = c(.5, .4, .3, .2, .1))
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     1 a     
#>  2         1     3 c     
#>  3         2     3 c     
#>  4         2     5 e     
#>  5         3     5 e     
#>  6         3     2 b     
#>  7         4     1 a     
#>  8         4     3 c     
#>  9         5     1 a     
#> 10         5     2 b

# works when weight vec doesn't sum to 1
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = c(.5, .5, .5, .5, .5))
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     5 e     
#>  2         1     4 d     
#>  3         2     2 b     
#>  4         2     3 c     
#>  5         3     2 b     
#>  6         3     4 d     
#>  7         4     1 a     
#>  8         4     4 d     
#>  9         5     3 c     
#> 10         5     5 e

# but doesn't work when weight vec is integer
rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = id)
#> Error: `weight_by` must be 'numeric vector with length `nrow(.data)` = 5', not 'closure'.

# however integer weight vec works for slice_sample()
slice_sample(df, n = 2, 
             weight_by = id)
#> # A tibble: 2 × 2
#>      id letter
#>   <int> <fct> 
#> 1     5 e     
#> 2     3 c

Created on 2023-03-07 by the reprex package (v2.0.1)

simonpcouch commented 1 year ago

Hey Andrew!

Note that integers are numerics:

is.numeric(1L)
#> [1] TRUE

so this isn’t actually an issue with integers vs floats.

It looks like the error you saw here is:

Error: weight_by must be ‘numeric vector with length nrow(.data) = 5’, not ‘closure’.

This is because rep_slice_sample doesn’t support tidyselect, so needs the actual weights vector df$id rather than the tidyselect spec id. Passing the actual weights works fine:

library(tidyverse)
library(infer)

df <- tibble(id = 1:5, 
             letter = factor(c("a", "b", "c", "d", "e")))

rep_slice_sample(df, n = 2, reps = 5, 
                 weight_by = df$id)
#> # A tibble: 10 × 3
#> # Groups:   replicate [5]
#>    replicate    id letter
#>        <int> <int> <fct> 
#>  1         1     4 d     
#>  2         1     5 e     
#>  3         2     5 e     
#>  4         2     4 d     
#>  5         3     3 c     
#>  6         3     5 e     
#>  7         4     5 e     
#>  8         4     1 a     
#>  9         5     4 d     
#> 10         5     5 e

If we feel it's in scope for the package to support tidy selection here as well, I'm up for it!

Created on 2023-03-08 with reprex v2.0.2

andrewpbray commented 1 year ago

Oh, I'm so glad you reviewed this! That's news to me that integers are numerics.

Since this function was adapted (from rep_sample_n()) to mirror changes to dplyr, to me makes sense to keep that parallel going in terms of tidyselect. Do you think it could just be implemented here or if we should really back propagate to any function where tidyselect could play a role? Seeing as how this is just one of the ancillary functions, maybe we could restrict the implementation of tidyselect to just this function?

simonpcouch commented 1 year ago

It'd feel fine to me to just implement that change only to weight_by in this function!

(Ope, I wrote tidyselect when I meant data-masking. Woopsies.😆) We already use something data-masking-esque in specify() and now generate() (via the variables argument). generate() uses a check_cols() helper that makes this happen, so we can piggyback off of that machinery.

andrewpbray commented 1 year ago

Sounds good to me!

Do you have time to do this? If not, I can take a swipe but it'd likely be awhile before I'd have the brain-space to reacquaint myself with the codebase.

simonpcouch commented 1 year ago

Yup, sure thing!🥑

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.