hredestig / pcaMethods

Perform PCA on data with missing values in R
GNU General Public License v2.0
45 stars 10 forks source link

are the rows and columns correct in bpca? #1

Closed hredestig closed 8 years ago

hredestig commented 8 years ago

Rescuing and old issue that got lost in git/svn mayhem.

"It seems that the input matrix should be transposed in bpca given the matlab implementation?"

hredestig commented 8 years ago

In pcaMethods, we use the convention that the PCA model as X=mu+TP'+E where T is scores matrix and P the loadings. Observations are in rows and variables in columns. PCA is (mostly) done on the mean-centered matrix so that mu represents the mean values for each variable. For BPCA the situation is the same but the original implementation uses a different notation. The result is however the same. Let's test that.

wget http://ishiilab.jp/member/oba/tools/sources/sample999.dat

sample999 has 758 rows and 50 cols.

With pcaMethods convention columns are variables as we compute their averages and obtain those from M.mu.

x999 = load('sample999.dat');
[y, M] = BPCAfill(x999, k=2);
size(M.W) # 50 rows so this must be the loadings
loadings = M.W; # 50 rows
save('loadings.dat', 'loadings');
# we get the scores by matrix multiplication
scores = y * M.W;
save('scores.dat', 'scores');

Check similarity in R, loadings / scores can be rotated without any difference to the result

mat <- as.matrix(read.table("sample999.dat"))
mat[(mat - 999)^2 < 1e-4] <- NA
pc <- pca(mat, method="bpca", verbose=FALSE)
dim(loadings(pc))
## [1] 50  2
octload <- as.matrix(read.table("loadings.dat"))
octscores <- as.matrix(read.table("scores.dat"))
cor(octload, loadings(pc))
##         PC1       PC2
## V1 1.000000 -0.017361
## V2 0.017293 -1.000000
cor(octscores, scores(pc))
##           PC1         PC2
## V1 1.00000000 -0.00014171
## V2 0.00022622 -0.99999999

Result is the same. However, the BPCA matlab documentation clearly indicates that the input matrix is to be 'genes' (mostly interpreted as variables) x 'samples'. From that follows that BPCA in the matlab version simply switches the concept of variable and sample compared to the interpretation in pcaMethods. From the point of missing value estimation, one might get good results that way as well, as BPCA test also has clearly shown but for interpretation is it a bit unusual. For the sake of not causing confusion compared to how other pcaMethods work, we do not adopt the BPCA style but leave the option to transpose the input matrix to the user instead.

In pcaMethods 1.63.0 this is now further emphasized in ?bpca