Repeated matrix calculation in `predict()` inflates runtime

mixOmicsTeam / mixOmics

Development repository for the Bioconductor package 'mixOmics '

153 stars 51 forks source link

Ypred = lapply(1 : ncomp[i], function(x){concat.newdata[[i]] %*% Wmat[, 1:x] %*% solve(t(Pmat[, 1:x]) %*% Wmat[, 1:x]) %*% t(Cmat)[1:x, ]}) Ypred = sapply(Ypred, function(x){x*sigma.Y + means.Y}, simplify = "array") Y.hat[[i]] = array(Ypred, c(nrow(newdata[[i]]), ncol(Y), ncomp[i])) # in case one observation and only one Y, we need array() to keep it an array with a third dimension being ncomp t.pred[[i]] = concat.newdata[[i]] %*% Wmat %*% solve(t(Pmat) %*% Wmat) t.pred[[i]] = matrix(data = sapply(1:ncol(t.pred[[i]]), function(x) {t.pred[[i]][, x] * apply(variatesX[[i]], 2, function(y){(norm(y, type = "2"))^2})[x]}), nrow = nrow(concat.newdata[[i]]), ncol = ncol(t.pred[[i]])) B.hat[[i]] = sapply(1 : ncomp[i], function(x){Wmat[, 1:x] %*% solve(t(Pmat[, 1:x]) %*% Wmat[, 1:x]) %*% t(Cmat)[1:x, ]}, simplify = "array")

Here's a brief summary of the work I've done so far. Regarding specifically the inflated runtime.

The old set of code can be seen above. This was suspected to be inefficient due to the fact that $W* $ was seemingly calculated three times.

This is found in the first Ypred line and the B.hat line
- as Wmat[, 1:x] %*% solve(t(Pmat[, 1:x]) %*% Wmat[, 1:x])
It is also found in the first t.pred line
- as Wmat %*% solve(t(Pmat) %*% Wmat)

I adjusted the code as can be seen below:

W.star <- Wmat %*% solve(t(Pmat) %*% Wmat)

B.hat[[i]] = sapply(1 : ncomp[i], function(x){matrix(W.star[, 1:x], ncol=x) %*% t(Cmat)[1:x, ]}, simplify = "array")

# Prediction Y.hat, B.hat and t.pred
Ypred = lapply(1 : ncomp[i], function(x){concat.newdata[[i]] %*% B.hat[[i]][,,x]})
Ypred = sapply(Ypred, function(x){x*sigma.Y + means.Y}, simplify = "array")

Y.hat[[i]] = array(Ypred, c(nrow(newdata[[i]]), ncol(Y), ncomp[i])) # in case one observation and only one Y, we need array() to keep it an array with a third dimension being ncomp

t.pred[[i]] = concat.newdata[[i]] %*% W.star
t.pred[[i]] = matrix(data = sapply(1:ncol(t.pred[[i]]),
                                   function(x) {t.pred[[i]][, x] * apply(variatesX[[i]], 2,
                                                                         function(y){(norm(y, type = "2"))^2})[x]}), nrow = nrow(concat.newdata[[i]]), ncol = ncol(t.pred[[i]]))

The main differences are:

Calculating $W^*$ at the start as this is used multiple times
Calculating B.hat next as this is used to calculate Ypred. Before it was essentially calculated twice, now only once
Replaced the "long form" of $W^*$ with the pre-defined form in any calculations that require it

Now seeing as this reduces the number of required matrix multiplcations, I assumed this would reduce run time. Using:

X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

and randomly generating 100 samples with the same number of columns as X (each using a unique normal distribution) to use as testing data. I ran the default predict() function and the adjusted predict() function 5000 times and evaluated their runtimes. Also, for peace of mind, every iteration, the predictions by the two forms of predict() were validated to be equal (to 10 significant figures). Histograms of the runtimes can be seen below:

So you can see that the runtime was not improved at all, if anything made worse. This was the same using subsets of the liver.toxicity$gene data for training and testing. This result seems counterintuitive

mixOmicsTeam / mixOmics

Repeated matrix calculation in `predict()` inflates runtime #236