privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
183 stars 43 forks source link

Getting started (FBM) #459

Closed DobraVila closed 5 months ago

DobraVila commented 8 months ago

A follow-up from the email.

The original question was:

  1. Why is the same object from class FBM retrieved when a sparse linear regression model is fitted (using the function big_spLinReg) and when the prediction model is created (using the function predict):

X <- FBM(N, M, init = rnorm(N* M, sd = 5)) mod <- big_spLinReg(X, y[ind.train], ind.train = ind.train, K = 4) pred <- predict(mod, X, ind.test)

I understand the part that the original data is split into train and test, but I don't understand why X is required for both model building and model testing. Could you please clarify this a bit more?

privefl commented 8 months ago

Basically, X[ind.train, ] is used for training, and X[ind.test, ] for testing. But, because X is a data on disk that we don't want to subset, I'm telling big_spLinReg() to use only ind.train = ind.train (subset of rows that are accessed from X), and predict() to use only ind.test.