privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

Estimation of model performance #163

Closed arnonl closed 1 year ago

arnonl commented 1 year ago

Hi I am using big_spLogReg. I run a logistic model on a small set of 85 SNPs. In every run of the model and testing it on the test data with predict() and then AUC() I get quite a different value. I also tested what variables are chosen in every run, running the model 10 times (kept those with OR above 1.01 and below 0.99). I get a few SNPs that repeatedly selected (all the 10 times) and others with distribution of selection prevalence. What is the strategy to go ahead here? Can I set a criteria on the repeatedly selected SNPs? If I want to get a rough estimation of the classification persormance, should I just run it several time and avarage the AUC with its SE?

Thank you for your help.

privefl commented 1 year ago

Hi,

  1. The implementation provided is not suitable for running with only a few variables (SNPs); it's more for 100s of 1000s of variables. You should probably directly run {glmnet} instead, or even just standard logistic regression (without penalization).
  2. LASSO is known to be not particularly stable in its selection of variables.
  3. Have a look at the {glmnet} vignette to find more suitable models; but discussing variable selection is out-of-scope of the help I provide here, sorry.