Closed ClaytonJY closed 6 years ago
Yes, this seems to be the same as #23. caret
has a similar function called groupKFold
.
To do that now I have to k-fold on just
distinct(mtcars, cyl)
, and then do something hacky to "expand" those folds.
Yep! I'll get the unique values of the variable used for splitting and perform V-fold CV on those values, then translate that to rows of the original data. Doing it this way means that, if you have a large number of values in the splitting variable, you don't have to have a separate split for each value.
I'll close this. If you disagree with the equivalence of the two issues, please reopen.
Give the function in devel (group_vfold_cv
) a try and see if it does what you need.
@topepo that does it!
bonus question: do the longer-term plans for this package include using tidyeval, so that quoting of the group variable is optional?
extra bonus: would it make sense to allow for multiple grouping variables, e.g. group_vfold_cv(mtcars, c("cyl", "vs"), 4)
? Would allow user to avoid fusing multiple group-defining vars into a new singular group variable.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
I don't know what to call this, but I'll try to explain my use-case.
I've got a set of data I want to split up for cross-validation (assume v-folding). These observations have a grouping variable, and I want to ensure all groups are kept together, and never split up, when sampling here. Almost an opposite of
strata
.As an example, we could use
mtcars
and thecyl
variable; there's 3 unique values (4, 6, 8), so a 3-fold of this type should produce one fold where the assessment is onlycyl = 4
, another wherecyl = 6
, etc.To do that now I have to k-fold on just
distinct(mtcars, cyl)
, and then do something hacky to "expand" those folds.Would it be possible to combine
nested_cv
with #23 to achieve this? If not, and this is worthy of inclusion, I'd be happy to help code it up.Here's my hack:
Also available in this gist.
Tried to make a multi-variable version, but it's a lot harder to get "indices of rows in tibble x that match tibble y" than I expected;
dplyr
pushes that all down into Rcpp-land for the*_join
functions :(