grouped data - Githubissues

With grouped data it is important that if one row of a group is in the training set other then other rows in that group cannot be in the test set That is instead of sampling individual rows sample groups. This link shows an example and there is another example further down here.

https://stackoverflow.com/questions/71087864/how-to-keep-grouped-variables-together-in-training-and-test-data

Perhaps allow the holdout= argument to be a vector of indexes or provide for a group= argument. The first possibility would allow other schemes as well whereas the second is easier for the user in this situation but does not allow for unanticipated sampling schemes. It would be possible to have both, of course.

I am currently kludging it using this where the example is iris assuming each successive 10 rows forms a group.

# iris where each successive 10 rows forms a group
library(qeML)
set.seed(123)

# create grouping variable 
grp <- rep(1:15, each = 10)

# set holdout indexes so that if a row is in test or is in train then others in group are too
holdout <- which(grp %in% sample(15, 3))

# kludge it by redefining sample within qeKNN to return the indexes we want
trace(qeKNN, quote(sample <- function(x, holdout) holdout))
qeKNN(iris, "Species", holdout = holdout)
untrace(qeKNN)

matloff / qeML

grouped data #11