Closed mikemahoney218 closed 2 years ago
Believe the issue is here: https://github.com/tidymodels/rsample/blob/main/R/make_groups.R#L118
vec_count()
apparently isn't deterministic across sessions when sort = "count"
and there are duplicates with the same number of values (which is why we haven't noticed with things like Ames, where there's few to no groups with the same counts).
Similar usage here, so this probably impacts vfold as well: https://github.com/tidymodels/rsample/blob/main/R/make_groups.R#L80
Thanks a lot for the report @Jeffrothschild ! This should now be fixed in the development version of rsample. The following is now stable across sessions:
library(rsample)
set.seed(3332)
assessment(group_initial_split(dplyr::starwars, name))
#> # A tibble: 22 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 2 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 3 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
#> 4 Palpati… 170 75 grey pale yellow 82 male mascu…
#> 5 Boba Fe… 183 78.2 black fair brown 31.5 male mascu…
#> 6 Nien Nu… 160 68 none grey black NA male mascu…
#> 7 Nute Gu… 191 90 none mottled g… red NA male mascu…
#> 8 Finis V… 170 NA blond fair blue 91 male mascu…
#> 9 Watto 137 NA black blue, grey yellow NA male mascu…
#> 10 Gasgano 122 NA none white, bl… black NA male mascu…
#> # … with 12 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
Created on 2022-07-13 by the reprex package (v2.0.1)
library(rsample)
set.seed(3332)
assessment(group_initial_split(dplyr::starwars, name))
#> # A tibble: 22 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 2 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 3 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
#> 4 Palpati… 170 75 grey pale yellow 82 male mascu…
#> 5 Boba Fe… 183 78.2 black fair brown 31.5 male mascu…
#> 6 Nien Nu… 160 68 none grey black NA male mascu…
#> 7 Nute Gu… 191 90 none mottled g… red NA male mascu…
#> 8 Finis V… 170 NA blond fair blue 91 male mascu…
#> 9 Watto 137 NA black blue, grey yellow NA male mascu…
#> 10 Gasgano 122 NA none white, bl… black NA male mascu…
#> # … with 12 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> # films <list>, vehicles <list>, starships <list>
Created on 2022-07-13 by the reprex package (v2.0.1)
Awesome, thanks so much for the quick fix @mikemahoney218 This will save me so much time.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
@mikemahoney218 this is awesome and just what I need, thank you! One thing I noticed though, is that set.seed doesn't seem to be applied to the splits?
Is it possible to have set.seed allow the group_initial_split() to separate things in the same way?
Here is an example. It does seem to stay the same when repeating the process in the same session, but if you restart R studio you'll see different colors each time.
Thanks
Originally posted by @Jeffrothschild in https://github.com/tidymodels/rsample/issues/207#issuecomment-1182754852