somakd / fad

Factor Analysis for Data in R
3 stars 1 forks source link

Working with really large p, too small n data. #1

Closed kdorman closed 4 years ago

kdorman commented 4 years ago

svaseq.pdf


title: "svaseq vs fad" author: "yonghui huo" date: "September 25, 2019" output: pdf_document: default html_document: df_print: paged

Model

Suppose $\boldsymbol{x}_1,...,\boldsymbol{x}n$ are i.i.d.observed samples from a $m$-dimensional multivariate Normal distribution, then, [ \boldsymbol{X}{m\times n} = \boldsymbol{B}{m\times d}\boldsymbol{S}{d\times n} + \boldsymbol{\Gamma}{m\times r} \boldsymbol{G}{r\times n} + \boldsymbol{U}_{m\times n}, ] where $m$ is the number of genes and each symbol has the following properties: \begin{align} \boldsymbol{g}i & \sim \mathcal{N}\left(0, \boldsymbol{I}{r\times r}\right) \ \boldsymbol{u}i & \sim \mathcal{N}\left(0, \boldsymbol{D}{m\times m}\right) \ \boldsymbol{x}_i & \sim \mathcal{N}\left(\boldsymbol{\mu}, \boldsymbol{\Gamma}\boldsymbol{\Gamma}^{\top} + \boldsymbol{D}\right) \end{align}

We first ``remove the fixed effects'' by ignoring the random factors, fitting the model by least squares, and calculating the residual matrix [ \boldsymbol{X}{(0)} = \boldsymbol{X} - \hat{\boldsymbol{B}}\boldsymbol{S}, ] where $\hat{\boldsymbol{B}} = \left(\boldsymbol{S}^{\top}\boldsymbol{S}\right)^{-} \boldsymbol{S}^{\top} \boldsymbol{X}$. Then, we will perform factor analysis on $\boldsymbol{X}{(0)}$ using two methods. We realize the major problem is the sample size, but the package could warn when the user inputs ridiculous data dimensions, and confusingly results \textit{are} given for the ridiculous case of $r=4$ factors.

Setup


library(knitr)
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
        message = FALSE, cache.lazy = FALSE)

if (!require("BiocManager", quietly = TRUE, warn.conflicts = F)) {
    install.packages("BiocManager")
    BiocManer::install("zebrafishRNASeq")
    BiocManager::install("EDASeq")
}
if (!require("devtools", quietly = TRUE, warn.conflicts = F)) {
        install.packages("devtools")
}
if (!require("fad", quietly = TRUE, warn.conflicts = F)) {
    install_github("somakd/fad")
}

library(fad, warn.conflicts = F, quietly = T)
library(zebrafishRNASeq, warn.conflicts = F, quietly = T)
library(EDASeq, warn.conflicts = F, quietly = T)
library(MASS, warn.conflicts = F, quietly = T)
# dataset we use is zebrafishRNASeq
# filter low count genes
data(zfGenes)
filter = apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered = zfGenes[filter,]

# set up model matrix S
group = as.factor(rep(c("Ctl", "Trt"), each=3))
set = newSeqExpressionSet(as.matrix(filtered),
                           phenoData = data.frame(group, row.names=colnames(filtered)))

# S is the model matrix for known factors.
S <- model.matrix(~group)

X = counts(set)

# get the residuals X0 which will be used for factor analysis
X0 <- X %*%(diag(ncol(X)) - S %*% ginv(t(S) %*% S) %*% t(S))

fad Method

res <- fad(t(X0), 1)
# Warning message:
# In fad(t(X0), 1) :
# Algorithm may not have converged. Try another starting value.

# But counterintuitively ask for for 4 factors and we get a result:
res <- fad(t(X0), 4)
D <- res$uniquenesses
D1 <- diag(D)
D.inverse <- diag(1/D)
gamma <- res$loadings
svd.H <- svd(t(gamma) %*% D1 %*% gamma)
G <- (svd.H$u) %*% ginv(diag(svd.H$d)) %*% t(svd.H$v) %*% t(gamma) %*% D.inverse %*% X0
G

svaseq Method

uu <- eigen(t(X0) %*% X0)
vv <- uu$vectors
vv[1:4,]
somakd commented 4 years ago

Thanks for reporting the issue and painstakingly compiling the reproducible example.

I understand the case with four factors. The likelihood is unbounded:

A QR decomposition of the matrix X0 reveals that it has rank 4, despite being a 20865 x 6 matrix:

qr(X0)$rank

According to Proposition 3 of Roberts and Symons (2007), the likelihood is unbounded when the number of factors is equal to the rank. In fact, the proposition states that the the likelihood approaches infinity as the uniquenesses tend to zero. As a result all the estimates of the uniquenesses occur at the boundary set 0.005 by default (The argument lower=0.005 in fad).

res4 = fad(t(X0), 4)
all(res4$uniquenesses == 0.005)

However, I'm not sure what's wrong with number of factors less than or equal to 3. My hunch is that the defect number (as defined in Roberts and Symons, 2007) of the covariance matrix is 3. It's surely bigger than 1 because the data-matrix t(X0) is not full row-rank. In that case, Theorem 3(ii) indicates that the likelihood is also unbounded with number of factors smaller than 4.

In general, it is very difficult to compute the rank of a matrix in a matrix-free manner. In this problem, n is very small, so a QR decomposition is possible. Computing the defect is even more difficult as it requires computation of the null-space followed by its echelon reduction.

In the next release, we will compute the ranks of matrices when minimum of the two dimensions is at most 20 and warn if we detect rank defect.

Reference: Robertson, D., and Symons, J. (2007). Maximum Likelihood Factor Analysis With Rank-Deficient Sample Covariance Matrices, Journal of Multivariate Analysis, 98, 813–828.