Cross validation IDs in getCVids

Heya! Came across some funny behaviour with the getCVids() function with some data I was working with.

In getCVids, this bit of code is being used to generate the CV fold ID column:

if (length(block) == 1) {
  blocks <- unique(data[, block])
}
else {
  blocks <- unique(block)
}
nBlocks <- length(blocks)
n_cv <- ceiling(nBlocks/folds)
set.seed(seed)
id_block_cv <- sample(rep(1:folds, n_cv), n_cv * folds)
id_block_cv <- id_block_cv[1:length(blocks)]
id_cv <- numeric(length(block))
for (xi in 1:nBlocks) {
  rows <- which(block == blocks[xi])
  id_cv[rows] <- id_block_cv[xi]
}

For a dataset that I was working with which has 13 unique blocks, asking for K = 10 cross-validation folds was not giving me 10 CV folds.

> unique(id_cv)
[1]  5  4  2  8  7 10  3  6

As you can see in the example above, this is a list of only 8 possible ids. However, if I comment out the set.seed() function and run this several times (allowing the sample() function to run with different subsets), I was able to eventually get a list of IDs that had 10 CV folds:

## After 4 runs
> unique(id_cv)
 [1]  3  4  1  8  6  5 10  7  9  2

The problem that arises here is that if the folds don't contain every value of a sequence (i.e. if K = 10, there needs to be CV ids of 1,2,3,4,5,6,7,8,9 and 10), when you run the cv.gamMRsea() function, it breaks down because at line 45 of the function you iterate through items 1: seq_len(ms), where ms is the max value of unique(id_cv). In this case, 1:10. So if id = 2 is missing (for example), you get an error when you go to calculate the cost function on line 69.

A few ideas:
1) Write in an error that has the user change seeds before the function

id_cv <- numeric(length(block))
for (xi in 1:nBlocks) {
  rows <- which(block == blocks[xi])
  id_cv[rows] <- id_block_cv[xi]
}
if(length(unique(id_cv))!= folds){
stop("Not enough folds generated, change seed and try again")}

2) Write in a sampling function that changes probabilities of sampling based on which IDs have already been sampled. I've tested this out and it seems to work well and will nearly always create the exact number of k folds each time:

set.seed(1)
kfold <- 10
probs <- c(rep(0.1,kfold))
id_cv <- numeric(length(block))
for(i in 1:nBlocks){
  idval <- sample(c(1:kfold),size = 1,prob=probs)
  print(idval)
  rows <- which(block == blocks[i])
  id_cv[rows] <- idval

  ### Check which idvals/ folds already exist, add probability of selection to all others
  ## Reset probs
  probs <- c(rep(0.1,kfold))
  ## Add higher probability to fold being sampled if some have already been sampled
  probinc <- length(unique(id_cv[id_cv!=0]))/10
  probs[-(unique(id_cv[id_cv!=0]))] <- probs[-(unique(id_cv[id_cv!=0]))] + probinc
  probs[unique(id_cv[id_cv!=0])] <- 1/kfold

  print(probs)

}
table(id_cv)

The final function would look like this:

testfun <- function (data, folds, block = NULL, seed = 1234) 
{
  if (is.null(block) == T) {
    N <- 1:nrow(data)
    n_cv <- ceiling(length(N)/folds)
    set.seed(seed)
    id_cv <- sample(rep(1:folds, n_cv), n_cv * folds)
    id_cv <- id_cv[1:length(N)]
  }
  else {
    if (length(block) == 1) {
      blocks <- unique(data[, block])
    }
    else {
      blocks <- unique(block)
    }
    nBlocks <- length(blocks)
    n_cv <- ceiling(nBlocks/folds)

    set.seed(seed)

    probs <- c(rep(0.1,folds))
    id_cv <- numeric(length(block))
    for(i in 1:nBlocks){
      idval <- sample(c(1:folds),size = 1,prob=probs)
      rows <- which(block == blocks[i])
      id_cv[rows] <- idval

      ### Check which idvals/ folds already exist, add probability of selection to all others
      ## Reset probs
      probs <- c(rep(0.1,folds))
      ## Add higher probability to fold being sampled if some have already been sampled
      probinc <- length(unique(id_cv[id_cv!=0]))/10
      probs[-(unique(id_cv[id_cv!=0]))] <- probs[-(unique(id_cv[id_cv!=0]))] + probinc
      probs[unique(id_cv[id_cv!=0])] <- 1/kfold
    }
  }
  return(id_cv)
}

> output <- testfun(data=DF,folds=10,block=DF$transectId,seed=2)
> unique(output)
 [1]  3  8  6  9 10  4  5  1  2  7
> output <- testfun(data=DF,folds=10,block=DF$transectId,seed=3)
> unique(output)
 [1]  3  9  5  2  6 10  8  4  1  7
> output <- testfun(data=DF,folds=10,block=DF$transectId,seed=4)
> unique(output)
 [1]  7  2  4  1  3  8  5  6 10  9

You can see now in the above function that the correct number of folds are being assigned.

I hope that's useful.

lindesaysh / MRSea

Cross validation IDs in getCVids #10