tidymodels / rsample

Classes and functions to create and summarize resampling objects
https://rsample.tidymodels.org
Other
338 stars 67 forks source link

Identical seeds produce different results across sessions #342

Closed mikemahoney218 closed 2 years ago

mikemahoney218 commented 2 years ago

@mikemahoney218 this is awesome and just what I need, thank you! One thing I noticed though, is that set.seed doesn't seem to be applied to the splits?

Is it possible to have set.seed allow the group_initial_split() to separate things in the same way?

Here is an example. It does seem to stay the same when repeating the process in the same session, but if you restart R studio you'll see different colors each time.

library(tidyverse)
library(rsample)

df <- starwars %>% 
  mutate(name = factor(name))

set.seed(3332)
group_split <- group_initial_split(df, group = name)
group_train <- training(group_split)
group_test <- testing(group_split)

group_train %>% select(mass, name) %>% mutate(group = "train") %>% 
  bind_rows(group_test %>% select(mass, name) %>% mutate(group = "test")) %>% 
  ggplot(aes(mass, name, color = group))+
  geom_point()

Thanks

Originally posted by @Jeffrothschild in https://github.com/tidymodels/rsample/issues/207#issuecomment-1182754852

mikemahoney218 commented 2 years ago

Believe the issue is here: https://github.com/tidymodels/rsample/blob/main/R/make_groups.R#L118

vec_count() apparently isn't deterministic across sessions when sort = "count" and there are duplicates with the same number of values (which is why we haven't noticed with things like Ames, where there's few to no groups with the same counts).

mikemahoney218 commented 2 years ago

Similar usage here, so this probably impacts vfold as well: https://github.com/tidymodels/rsample/blob/main/R/make_groups.R#L80

mikemahoney218 commented 2 years ago

Thanks a lot for the report @Jeffrothschild ! This should now be fixed in the development version of rsample. The following is now stable across sessions:

library(rsample)

set.seed(3332)
assessment(group_initial_split(dplyr::starwars, name))
#> # A tibble: 22 × 14
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 C-3PO       167  75   <NA>       gold       yellow         112   none  mascu…
#>  2 Darth V…    202 136   none       white      yellow          41.9 male  mascu…
#>  3 Obi-Wan…    182  77   auburn, w… fair       blue-gray       57   male  mascu…
#>  4 Palpati…    170  75   grey       pale       yellow          82   male  mascu…
#>  5 Boba Fe…    183  78.2 black      fair       brown           31.5 male  mascu…
#>  6 Nien Nu…    160  68   none       grey       black           NA   male  mascu…
#>  7 Nute Gu…    191  90   none       mottled g… red             NA   male  mascu…
#>  8 Finis V…    170  NA   blond      fair       blue            91   male  mascu…
#>  9 Watto       137  NA   black      blue, grey yellow          NA   male  mascu…
#> 10 Gasgano     122  NA   none       white, bl… black           NA   male  mascu…
#> # … with 12 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Created on 2022-07-13 by the reprex package (v2.0.1)

library(rsample)

set.seed(3332)
assessment(group_initial_split(dplyr::starwars, name))
#> # A tibble: 22 × 14
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 C-3PO       167  75   <NA>       gold       yellow         112   none  mascu…
#>  2 Darth V…    202 136   none       white      yellow          41.9 male  mascu…
#>  3 Obi-Wan…    182  77   auburn, w… fair       blue-gray       57   male  mascu…
#>  4 Palpati…    170  75   grey       pale       yellow          82   male  mascu…
#>  5 Boba Fe…    183  78.2 black      fair       brown           31.5 male  mascu…
#>  6 Nien Nu…    160  68   none       grey       black           NA   male  mascu…
#>  7 Nute Gu…    191  90   none       mottled g… red             NA   male  mascu…
#>  8 Finis V…    170  NA   blond      fair       blue            91   male  mascu…
#>  9 Watto       137  NA   black      blue, grey yellow          NA   male  mascu…
#> 10 Gasgano     122  NA   none       white, bl… black           NA   male  mascu…
#> # … with 12 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Created on 2022-07-13 by the reprex package (v2.0.1)

Jeffrothschild commented 2 years ago

Awesome, thanks so much for the quick fix @mikemahoney218 This will save me so much time.

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.