lindesaysh / MRSea

MRSea package construction
http://lindesaysh.github.io/MRSea/
7 stars 5 forks source link

Cross validation IDs in getCVids #10

Open grwhumphries opened 3 years ago

grwhumphries commented 3 years ago

Heya! Came across some funny behaviour with the getCVids() function with some data I was working with.

In getCVids, this bit of code is being used to generate the CV fold ID column:

if (length(block) == 1) {
  blocks <- unique(data[, block])
}
else {
  blocks <- unique(block)
}
nBlocks <- length(blocks)
n_cv <- ceiling(nBlocks/folds)
set.seed(seed)
id_block_cv <- sample(rep(1:folds, n_cv), n_cv * folds)
id_block_cv <- id_block_cv[1:length(blocks)]
id_cv <- numeric(length(block))
for (xi in 1:nBlocks) {
  rows <- which(block == blocks[xi])
  id_cv[rows] <- id_block_cv[xi]
}

For a dataset that I was working with which has 13 unique blocks, asking for K = 10 cross-validation folds was not giving me 10 CV folds.

> unique(id_cv)
[1]  5  4  2  8  7 10  3  6

As you can see in the example above, this is a list of only 8 possible ids. However, if I comment out the set.seed() function and run this several times (allowing the sample() function to run with different subsets), I was able to eventually get a list of IDs that had 10 CV folds:

## After 4 runs
> unique(id_cv)
 [1]  3  4  1  8  6  5 10  7  9  2

The problem that arises here is that if the folds don't contain every value of a sequence (i.e. if K = 10, there needs to be CV ids of 1,2,3,4,5,6,7,8,9 and 10), when you run the cv.gamMRsea() function, it breaks down because at line 45 of the function you iterate through items 1: seq_len(ms), where ms is the max value of unique(id_cv). In this case, 1:10. So if id = 2 is missing (for example), you get an error when you go to calculate the cost function on line 69.

A few ideas:
1) Write in an error that has the user change seeds before the function

id_cv <- numeric(length(block))
for (xi in 1:nBlocks) {
  rows <- which(block == blocks[xi])
  id_cv[rows] <- id_block_cv[xi]
}
if(length(unique(id_cv))!= folds){
stop("Not enough folds generated, change seed and try again")}

2) Write in a sampling function that changes probabilities of sampling based on which IDs have already been sampled. I've tested this out and it seems to work well and will nearly always create the exact number of k folds each time:

set.seed(1)
kfold <- 10
probs <- c(rep(0.1,kfold))
id_cv <- numeric(length(block))
for(i in 1:nBlocks){
  idval <- sample(c(1:kfold),size = 1,prob=probs)
  print(idval)
  rows <- which(block == blocks[i])
  id_cv[rows] <- idval

  ### Check which idvals/ folds already exist, add probability of selection to all others
  ## Reset probs
  probs <- c(rep(0.1,kfold))
  ## Add higher probability to fold being sampled if some have already been sampled
  probinc <- length(unique(id_cv[id_cv!=0]))/10
  probs[-(unique(id_cv[id_cv!=0]))] <- probs[-(unique(id_cv[id_cv!=0]))] + probinc
  probs[unique(id_cv[id_cv!=0])] <- 1/kfold

  print(probs)

}
table(id_cv)

The final function would look like this:

testfun <- function (data, folds, block = NULL, seed = 1234) 
{
  if (is.null(block) == T) {
    N <- 1:nrow(data)
    n_cv <- ceiling(length(N)/folds)
    set.seed(seed)
    id_cv <- sample(rep(1:folds, n_cv), n_cv * folds)
    id_cv <- id_cv[1:length(N)]
  }
  else {
    if (length(block) == 1) {
      blocks <- unique(data[, block])
    }
    else {
      blocks <- unique(block)
    }
    nBlocks <- length(blocks)
    n_cv <- ceiling(nBlocks/folds)

    set.seed(seed)

    probs <- c(rep(0.1,folds))
    id_cv <- numeric(length(block))
    for(i in 1:nBlocks){
      idval <- sample(c(1:folds),size = 1,prob=probs)
      rows <- which(block == blocks[i])
      id_cv[rows] <- idval

      ### Check which idvals/ folds already exist, add probability of selection to all others
      ## Reset probs
      probs <- c(rep(0.1,folds))
      ## Add higher probability to fold being sampled if some have already been sampled
      probinc <- length(unique(id_cv[id_cv!=0]))/10
      probs[-(unique(id_cv[id_cv!=0]))] <- probs[-(unique(id_cv[id_cv!=0]))] + probinc
      probs[unique(id_cv[id_cv!=0])] <- 1/kfold
    }
  }
  return(id_cv)
}

> output <- testfun(data=DF,folds=10,block=DF$transectId,seed=2)
> unique(output)
 [1]  3  8  6  9 10  4  5  1  2  7
> output <- testfun(data=DF,folds=10,block=DF$transectId,seed=3)
> unique(output)
 [1]  3  9  5  2  6 10  8  4  1  7
> output <- testfun(data=DF,folds=10,block=DF$transectId,seed=4)
> unique(output)
 [1]  7  2  4  1  3  8  5  6 10  9

You can see now in the above function that the correct number of folds are being assigned.

I hope that's useful.

lindesaysh commented 3 years ago

Thanks for this Grant. I'll take a look soon. The issue arises when you don't have very many blocks. The code is designed to equally weight each block with roughly the same number of data points (for evenness). When you have 10 folds and 13 blocks, depending on their size it might be pretty hard to get even numbers. I would use either 5 fold validation or just use all 13 blocks as folds (effectively, leave one correlated unit out at a time). I haven't had a chance to look at your suggestion just yet but i'll see if it makes sense. At the very least a warning message should be added so thanks for spotting this.