stephenslab / fsusieR

Draft package for "Sum of Single Functions" method.
https://stephenslab.github.io/fsusieR
Other
2 stars 1 forks source link

Intercept, prediction function and fitted values #11

Open pcarbo opened 3 days ago

pcarbo commented 3 days ago

@william-denault It is important to output the intercept for several reasons, but one is that it is often helpful to be able to use the model to predict Y (e.g., to compare the observed Y with the predicted Y). This also provides a simple "sanity check" to make sure that the model fitting is working. You will notice that susieR has two related features: (i) it returns the "fitted" values (in fit$fitted) and (ii) it has a "predict" method. Here's an example of what I mean:

# library(susieR)
set.seed(1)
n <- 400
p <- 1000
beta <- rep(0,p)
beta[1:4] <- 1
X <- matrix(rnorm(n*p),nrow = n,ncol = p)
X <- scale(X,center = TRUE,scale = TRUE)
y <- drop(X %*% beta + rnorm(n))
fit <- susie(X,y,L = 10)
ypred <- predict(fit,X)
plot(y,fit$fitted,pch = 20,xlab = "true",ylab = "estimated")
abline(a = 0,b = 1,lty = "dotted",col = "magenta")
plot(y,ypred,pch = 20,xlab = "true",ylab = "estimated")
abline(a = 0,b = 1,lty = "dotted",col = "magenta")

(In this case, fit$fitted and ypred give the same result.)

I suggest implementing (i) and (ii) for fsusieR, which should be straightforward once you have the intercept.

Note that the intercept is not simply the the mean of Y, but the mean after removing the effects of the SNPs. See line 460 of susie.R for how this is implemented in susieR. The general formula looks something like this:

$$ (Z^TZ)^{-1} Z\bar{r}, \quad \bar{r} = y - X\bar{b} $$

For an intercept, Z is simply a column of ones.

william-denault commented 2 days ago

Hi @pcarbo ,

It is less straightforward in this context. The main problem about fitted values/prediction is that to run fsusie, the Y matrix is expected to have 2^J columns, and when it is not the case, we actually remap the data using an interpolation scheme so that there are 2^J columns (see intro vignette section #### Handling function of any length and unevenly space data) . So the actual  output of fsusie is a list of the fitted functions of length 2^J

So, the "real" predictions are actually made on this grid. How should we proceed? Should we make different "predict" functions, one that uses the estimate and the other one that outputs an "interpolated version" of the  prediction on the original grid?