privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

Extracting scores from each inner validation #182

Closed leocob closed 6 months ago

leocob commented 6 months ago

On the explanation of CMSA, I read:

"A 'regularization path' of models is trained on the inner training set and the corresponding predictions (scores) for the inner validation set are computed"

Is it possible in some way to extract these scores from each inner validation?

Exploring the test object from the tutorial of big_spLobReg I can't seem to find them:

N <- 230
M <- 730
X <- FBM(N, M, init = rnorm(N * M, sd = 5))
y01 <- as.numeric((rowSums(X[, 1:10]) + 2 * rnorm(N)) > 0)
covar <- matrix(rnorm(N * 3), N)

ind.train <- sort(sample(nrow(X), 150))
ind.test <- setdiff(rows_along(X), ind.train)

# fitting model for multiple lambdas and alphas
test <- big_spLogReg(X, y01[ind.train], ind.train = ind.train,
                     covar.train = covar[ind.train, ],
                     alphas = c(1), K = 2, warn = FALSE)

# EXPLORING test OBJECT 
str(test)
List of 1 (# only 1 element since alphas has only 1 element
 $ :List of 2 (# results of 2 validation folds from K = 2)
  ..$ :List of 14
  .. ..$ intercept     : num -0.173
  .. ..$ beta          : num [1:733] 0.0419 0.0275 0 0 0 ...
  .. ..$ iter          : int [1:51] 0 3 2 2 2 2 2 2 2 3 ...
  .. ..$ lambda        : num [1:51] 0.235 0.227 0.219 0.212 0.205 ...
  .. ..$ alpha         : num 1
  .. ..$ loss          : num [1:51] 0.691 0.683 0.677 0.67 0.664 ...
  .. ..$ loss.val      : num [1:51] 0.686 0.684 0.682 0.679 0.677 ...
  .. ..$ message       : chr "No more improvement"
  .. ..$ nb_active     : int [1:51] 0 1 1 1 1 1 1 1 1 2 ...
  .. ..$ nb_candidate  : int [1:51] 0 1 1 1 1 1 1 1 1 2 ...
  .. ..$ ind.train     : int [1:75] 3 7 13 14 17 30 31 37 41 47 ...
  .. ..$ power_scale   : num 1
  .. ..$ power_adaptive: num 0
  .. ..$ time          : Named num 0.004
  .. .. ..- attr(*, "names")= chr "elapsed"
  .. ..- attr(*, "class")= chr "big_sp"
  ..$ :List of 14
  .. ..$ intercept     : num -0.193
  .. ..$ beta          : num [1:733] 0.0398 0 0 0 0.0713 ...
  .. ..$ iter          : int [1:51] 0 3 3 2 2 2 4 3 5 3 ...
  .. ..$ lambda        : num [1:51] 0.174 0.168 0.163 0.157 0.152 ...
  .. ..$ alpha         : num 1
  .. ..$ loss          : num [1:51] 0.686 0.68 0.672 0.658 0.646 ...
  .. ..$ loss.val      : num [1:51] 0.691 0.688 0.683 0.676 0.67 ...
  .. ..$ message       : chr "No more improvement"
  .. ..$ nb_active     : int [1:51] 0 2 4 4 4 4 6 6 8 8 ...
  .. ..$ nb_candidate  : int [1:51] 0 3 4 4 4 4 6 6 8 8 ...
  .. ..$ ind.train     : int [1:75] 5 6 8 9 10 18 20 21 24 25 ...
  .. ..$ power_scale   : num 1
  .. ..$ power_adaptive: num 0
  .. ..$ time          : Named num 0.006
  .. .. ..- attr(*, "names")= chr "elapsed"
  .. ..- attr(*, "class")= chr "big_sp"
 - attr(*, "class")= chr "big_sp_list"
 - attr(*, "family")= chr "binomial"
 - attr(*, "alphas")= num 1
 - attr(*, "ind.col")= int [1:730] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "ind.sets")= int [1:150] 2 1 1 2 1 1 1 2 2 2 ...
 - attr(*, "pf")= num [1:733] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "base")= num [1:150] 0 0 0 0 0 0 0 0 0 0 ...
privefl commented 6 months ago

For one fold, something like this should work:

obj_big_sp <- test[[1]][[1]]
ind_train_fold <- obj_big_sp$ind.train
ind_val_fold <- setdiff(ind.train, ind_train_fold)

predict(obj_big_sp, X, ind.row = ind_val_fold, ind.col = attr(test, "ind.col"),
        covar.row = covar[ind_val_fold, ])
leocob commented 6 months ago

Makes sense, thank you! :)