matloff / qeML

40 stars 14 forks source link

grouped data #11

Open ggrothendieck opened 10 months ago

ggrothendieck commented 10 months ago

With grouped data it is important that if one row of a group is in the training set other then other rows in that group cannot be in the test set That is instead of sampling individual rows sample groups. This link shows an example and there is another example further down here.

https://stackoverflow.com/questions/71087864/how-to-keep-grouped-variables-together-in-training-and-test-data

Perhaps allow the holdout= argument to be a vector of indexes or provide for a group= argument. The first possibility would allow other schemes as well whereas the second is easier for the user in this situation but does not allow for unanticipated sampling schemes. It would be possible to have both, of course.

I am currently kludging it using this where the example is iris assuming each successive 10 rows forms a group.

# iris where each successive 10 rows forms a group
library(qeML)
set.seed(123)

# create grouping variable 
grp <- rep(1:15, each = 10)

# set holdout indexes so that if a row is in test or is in train then others in group are too
holdout <- which(grp %in% sample(15, 3))

# kludge it by redefining sample within qeKNN to return the indexes we want
trace(qeKNN, quote(sample <- function(x, holdout) holdout))
qeKNN(iris, "Species", holdout = holdout)
untrace(qeKNN)
matloff commented 10 months ago

I will add a function to v.1.2, and then blog about it. Will post a link here, all probably later this week.