privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

How does dfmax work? #151

Closed biona001 closed 2 years ago

biona001 commented 2 years ago

I think this is more a question than an issue.

I did a sparse linear regression with dfmax=10000 which is throwing the Too many variables warning, but extracting the optimal beta gives me 23309 non-zero entries? Then I inspect (presumably?) the sparsity level for each lambda, and it never reaches much more than 10000.

# fit lasso and check fit
lasso.fit <- big_spLinReg(G$genotypes, y, covar.train=Z, dfmax=10000)
summary(lasso.fit)$message
[[1]]
 [1] "Too many variables" "Too many variables" "Too many variables"
 [4] "Too many variables" "Too many variables" "Too many variables"
 [7] "Too many variables" "Too many variables" "Too many variables"
[10] "Too many variables"
# extract best beta and count non-zero entries
result <- summary(lasso.fit, best.only = TRUE)
lasso_beta <- result$beta[[1]]
sum(lasso_beta != 0)
[1] 23309
# check first cv fold result and its active list
r = lasso.fit[[1]][10] 
r[[1]]$nb_active 
  [1]     0     1     1     1     1     1     1     1     1     1     1     1
 [13]     1     1     1     1     1     1     1     1     1     1     1     1
 [25]     1     1     1     1     1     1     1     1     1     1     1     1
 [37]     1     1     1     1     1     1     1     1     1     1     1     1
 [49]     1     1     1     1     1     2     2     2     2     2     2     2
 [61]     2     2     2     2     2     2     2     2     2     2     2     3
 [73]     3     3     3     3     3     3     3     3     4     4     8     8
 [85]     8     8     8     8     8    10    10    15    18    20    24    28
 [97]    34    37    40    47    52    57    63    72    78    87    98   107
[109]   122   138   155   174   196   210   228   257   277   316   337   378
[121]   414   462   515   561   619   670   734   813   880   971  1068  1165
[133]  1276  1400  1538  1689  1860  2035  2251  2433  2663  2947  3250  3565
[145]  3913  4294  4736  5161  5665  6210  6815  7414  8059  8805  9656 10517

I'm not really understanding where did 23309 come from? Also, it does seem a bit unexpected to me that specifying dfmax=10000 still gave me a model that had a lot more nonzero entries in it.

privefl commented 2 years ago

I guess if you check each model individually (corresponding to each of the CMSA splittings), you will get something a bit larger than 10K non-zero variables. But the final model averages all these models, so that you can have much more than 10K if the variables used are not the same in all K models.

biona001 commented 2 years ago

I see. You are taking each of the models with slightly more than 10k variables, and literally averaging their beta values. Thanks!