Speeding up the programme

After some timing tests, these are the bottlenecks in the programme:

The many matrix multiplications of the kernel matrix H.mat %*% H.mat. Need some clever way to reduce these occurences. This is doable I think. The plan is to do all possible matrix multiplications and store them, and use them as needed. This way they are not evaluated again and again.
Evaluating the inverse of Var.Y. The default R method is to use solve(Var.Y). A faster, albeit less stable, way is to use the Cholesky decomposition chol2inv(chol(Var.Y)). This is 5 times faster than the regular method.
Figure out an alternative method to estimate the intercept, because the current Generalised Least Squares method that I employ heavily relies on the inverse of Var.Y. If I can find another method which does not depend on this inverse, then I can speed up the programme. For example, sometimes it is as simple as intercept = mean(Y).
Calculation of the log-likelihood via the built in function dmvnorm() also takes a substantial amount of time, but not sure how much can be reduced by building own function. For reference, dmvnorm() takes 1/8th of the time to do a solve()Will look into this.

@Wicher2 can you comment on the above points, specifically 2-4? Thanks.

Hi Haziq,

Yes, this is a major bottleneck, especially for larger data sets, so these should be stored. This requires extra storage space, but that doesn't tend to be a problem in my experience. (For simple low-dimensional linear models this is not so eficient, but I wouldn't worry about that initially.)
After 1. has been done, this becomes the main bottleneck I think. There is also a possible problem with numerical instability (whether using solve or choleski). In fact, the inverse does not need to be computed, since it is always multiplied by a vector anyway, which entails solving a linear equation, which is faster. There are two issues, computing the determinant of Var.Y, and computing solve(Var.Y) %*% a for a vector a.

For solving linear equations stably and fast (though I did initially make an error here as we found last week):

  (\*solve (A+s\*I)x=b; stable for small s and singular A\*)
  Clear[linsolve]
  linsolve[a_, s_, b_] := Module[{eval, evec},
    {eval, evec} = Eigensystem[a];
    evec\[Transpose].((evec.b)/(eval + s))
    ];

The following generates a linear solve function, so the eigensystem only needs to be computed once if eg both solve(A,b) and solve(A,c) need to be computed, best to implement this function but maybe try out the above first.

  (\*create function u such that u[c] solves (A+s*I)x=c\*)
  linsolve[a_, s_] := Module[{eval, evec},
    {eval, evec} = Eigensystem[a];
    evec\[Transpose].((evec.#)/(eval + s)) &
    ];

Here & after an expression denotes a "pure function" with argument #.

Computing a determinant directly using in-built functions is potentially especially unstable (and I think both R and Mathematica use Intel libraries for this, so no fundamental differences). For the log determinant, the following is a standard relatively stable and fast method:

 logdet[m_]:=2Total[Log[Diagonal[CholeskyDecomposition[(m+m\[Transpose])/2]]]]

However, more efficient would be to write a procedure that combines the above linsolve and also compute the log determinant, in the above procedure a line like this would need to be added:

logdet=Total[Log[eval+s]]

and give this as output as well. Still need to implement this myself.

Actually now did this and the following seems most convenient for the present purposes, so output the eigenvalues of H \Psi H (these can be used eg for computing determinants but also for Tr[Inverse[Var.Y]]) and the linearsolve function:

EvalAndLinsolve[a_, s_] := Module[{eval, evec},
   {eval, evec} = Eigensystem[a];
   {eval, evec\[Transpose].((evec.#)/(eval + s)) &}
   ];

3 I always use intercept=mean(Y), why might this not work?
4 Yes, also here built in functions are slow and unstable for our setting. I think best to use the procedures under 2.

Hmm, some formatting goes wrong, I think it's not a problem but let me know if anything unclear. Update line logdet=Total[Log[eval+s]]

No problem, I formatted your comments so can see better (weird that I can do that actually).

OK, let me digest this. The implementation of your point 2 above seems worth doing.

As for the intercept=mean(Y) - for some reason I didn't think this was always the case. When I was initially playing around with the iprior code, I got the impression this was not the case, especially when mixing around the kernels, e.g. Canonical + Pearson, or when using the FBM kernel. I need to convince myself otherwise.

Apparently R already has a built-in linear solver. To solve Ax=b, use solve(A, b). When b is not specified, then the inverse is calculated. If our intention is to calculate A^{-1} %*% b then it is way faster to do solve(A,b) than doing solve(A) %*% b. I get that the advantage of coding ourselves like the way you explain above would allow us to store the eigensystem of A for later usage.
So the way my EM works is by a sequential update of the parameters. Update alpha then lambda_1 then ... then lambda_p then psi. At every stage of the update, the eigensystem A essentially changes, because A = psi * H_lambda %*% H_lambda. So there is no point in storing.
Furthermore, my exponential family EM also requires to calculate W.hat = solve(Var.Y) + tcrossprod(w.hat) which is the tilde W in your I-prior manuscript. So really, I do need to calculate this inverse.

However, with all these new matrix tricks I've learnt, there may be potential in speeding up the programme by going for the regular EM and abandoning the exponential family EM. This seems to be a big undertaking, which I will leave for a bit later. I am doing the predicted log-likelihood feature, and after that the FBM kernel (with fixed Hurst coefficient first).

As for the log-determinant, and subsequently the calculation of the log-likelihood, I can't seem to beat the built-in function dmvnorm(). I even tried the Cholesky trick. It seems the authors of dmvnorm() have already optimised their function (in R anyway).

Now interestingly, there is another R package which uses C++ code to evaluate multivariate normal densities, which claim to be faster than dmvnorm(). The link is https://github.com/mfasiolo/mvnfast and the R package is called mvnfast.

Of course if I eventually get the linear solver as you do it then it would be nice to get the log-determinant for free from by taking the sum of the eigenvalues, as you point out above. A mission for another day.

Yes I think in terms of speed for a single evaluation you won't beat the built in solvers, which should be close to the best possible. But as you say something can be saved by doing an eigendecomposition just once for every iteration, the saving can be substantial. But maybe even more important, sometimes you can get numerical problems by using built in functions, the functions I sent exploit the special structure of Var.Y to get better stability.

Regarding 2. above, yes, in that case there may not be a point in storing.

Regarding 3., indeed W.hat = solve(Var.Y) + tcrossprod(w.hat), but you only need W.hat in order to compute tr[A W.hat] for certain matrices A (nb am using mix of mma and R notation), so W.hat need not be computed explicitly. However computing W.hat might not be a major bottleneck and may make programming a bit easier.

Overall, it's a bit of a puzzle to figure out how best to optimize, but quite a bit of saving and improved stability can be attained.

All of the original issues will be addressed in the following update. Namely:

Kernel matrices H.mat, its square H.matsq and the like are called at each iteration. Computing squared matrices is expensive, so this is done and stored at the beginning and called when required.
Evaluating Var.Y.inv is now calculated using eigendecomposition trick above. The eigendecomposition is also used to calculate the log-determinant in the log-likelihood value, which is more stable. Note: sometimes we don't need explicit value of Var.Y.inv but rather Var.Y.inv %*% b where b might be a vector. In such cases, the linear solver described is less expensive to implement and this is also included in the upcoming udpate.
The MLE for the intercept is indeed mean(Y). Knowing this, the intercept is no longer a part of the EM update.
The lambda parameters are updated sequentially. Last time, we used to update Var.Y.inv everytime lambda[k] was updated, but this proved to be quite expensive. Convergence can still be achieved if these were not updated at every [k], but at every EM iteration instead.

I have also identified the code which takes the longest time to run in the iprior package, and these are:

Calculation of S.mat which involves manipulation of H.mat matrices and lambda.
Eigendecomposition procedure.
Calculation of H.mat.lamsq.
Calculation of Var.Y.inv.

Each are addressed below:

Calculation of `S.mat`

The bottleneck involving S.mat really is the many matrix multiplications that need to be done. By doing these at the beginning and storing them, then this issue goes away.

Eigendecomposition

I have ported this part of the code to C++ using the Eigen library. The default R function is eigen(). By passing the option symmetric=T, the function can run a little bit quicker. But porting the eigendecomposition to C++ is the fastest. The results are for a 1000 x 1000 matrix.

> microbenchmark(
        "eigen.R"=eigen(Z), 
        "eigen.R.symm"=eigen(Z, symmetric=T), 
        "Rcpp"=EigenCpp(Z)
    )
## Unit: milliseconds
##          expr       min        lq      mean    median        uq      max neval cld
##       eigen.R 1442.4800 1496.9692 1531.5655 1522.1981 1567.7182 1640.340   100   c
##  eigen.R.symm 1394.8606 1463.5625 1483.7722 1479.3523 1497.9794 1657.661   100  b 
##          Rcpp  883.0575  923.9286  953.3621  945.0064  973.2185 1129.158   100 a

Calculation of `H.mat.lamsq`

Typically, H.mat.lamsq = H.mat.lam %*% H.mat.lam is O(n^3), but the Eigen library in C++ does something clever - when squaring a symmetric matrix only the upper/lower triangle is used and duplicated. This results in a fast squaring operation. For a 1000 x 1000 matrix, the timings are:

> microbenchmark(
    "%*%"=Z%*%Z,
    "Rcpp"=FastSquare(Z)    
)
## Unit: milliseconds
##  expr       min        lq      mean    median       uq      max neval cld
##   %*% 573.78687 607.06477 641.69264 636.73526 674.1643 739.1447   100   b
##  Rcpp  87.65426  93.16196  97.42494  96.18368 101.1119 115.4750   100  a

Calculation of `Var.Y.inv`

As mentioned, we use the fact that Var.Y = A + diag(s), and if (u,V) are the eigenvalues and eigenvectors of A, then the inverse can be obtained by Var.Y.inv = V %*% diag(1/(u+s)) %*% t(V). A better way to do this is by expanding diag(1/(u+s) row-wise into a N x N matrix, and then taking the Hadamard product, as follows: Var.Y.inv = V * matrix(1/(u+s), nr=N, nc=N, byrow=T) %*% t(V). An even faster way is by doing t(t(V) * 1/(u+s)) %*% t(V). However, using C++ we can gain even more speed.

> microbenchmark(
    "naive"= tmp1 <- Z %*% diag(z) %*% t(Z), 
    "Hadamard"= tmp2 <- (Z * matrix(z, nr=1000, nc=1000, byrow=T))%*%t(Z), 
    "smart_transpose"= tmp3 <- t(t(Z) * z) %*% t(Z), 
    "Rcpp"= tmp4 <- FastVdiag2(Z, z)
)
## Unit: milliseconds
##             expr       min        lq      mean    median        uq       max neval cld
##            naive 1181.0930 1324.0387 1380.5610 1362.8738 1446.2571 1688.9346   100   c
##         Hadamard  592.4630  677.1544  743.0098  708.9243  794.9454 1009.0804   100  b 
##  smart_transpose  607.3514  669.5603  724.9866  702.4397  760.3104  946.3763   100  b 
##             Rcpp  160.4689  173.4585  199.6922  181.9600  200.6700  432.0324   100 a  
> all.equal(tmp1, tmp2, tmp3, tmp4)
## [1] TRUE

Conclusion

There are probably a few more tweaks that can be made. For instance, we don't really need H.mat.lamsq, but tr(H.mat.lamsq %*% W.hat), where W.hat=Var.Y.inv + tcrossprod(w.hat) - so can the matrix multiplication be avoided?

In any case, the changes that have been implemented do bring some improvement in speed. In practice, this is more than 4 times faster. For the cats dataset:

> system.time(mod.new <- iprior(Hwt~., data=cats, lambda=c(10,10), psi=10, progress="none"))
##    user  system elapsed 
##   6.573   1.096   7.631 
> system.time(mod.old <- iprior.old(Hwt~., data=cats, lambda=c(10,10), psi=10, progress="none"))
##    user  system elapsed 
##  33.979   2.754  36.565 

> coef(mod.new)
##  (Intercept)      lambda1      lambda2          psi
## 10.630555556 -0.002466022  1.002298802  0.477464200
> mod.new$log.lik
## [1] -260.3357
> coef(mod.old)
##  (Intercept)      lambda1      lambda2          psi 
## 10.630555556 -0.002465982  1.002274360  0.477481411 
> mod.old$log.lik
## [1] -260.3357

And for the stackloss dataset:

> system.time(mod.new <- iprior(stack.loss ~ ., data=stackloss, lambda=1:3, psi=10, progress="none"))
##    user  system elapsed 
##   0.226   0.016   0.244 
> system.time(mod.old <- iprior.old(stack.loss ~ ., data=stackloss, lambda=1:3, psi=10, progress="none"))
##    user  system elapsed 
##   1.588   0.051   1.629 

> coef(mod.new)
## (Intercept)     lambda1     lambda2     lambda3         psi 
## 17.52380952  0.04082678  0.22269432 -0.01226573  0.10574022   
> mod.new$log.lik
## [1] -56.34791
> coef(mod.old)
## (Intercept)     lambda1     lambda2     lambda3         psi 
## 17.52380952  0.04082316  0.22258619 -0.01226840  0.10576286 
> mod.old$log.lik
## [1] -56.34791

The action item after this next update will be: Are there any algorithmic improvements that can be made, e.g. nice matrix tricks to avoid matrix multiplications?

After yet another round of tweaking, managed to speed up the programme which now works for parsimonious interactions. Consider the hsb.small dataset (see here). The timings show a greater than four-fold improvement:

> profvis(mod.new <- iprior(mathach ~ ses + schoolid + ses:schoolid, data=hsb.small,
+ lambda=c(0.5,0.5), psi=0.05, report.int=10))
## time taken: 26130 ms for 68 iterations
## 384 ms per iteration
> profvis(mod.old <- iprior.old(mathach ~ ses + schoolid + ses:schoolid, data=hsb.small,
+ lambda=c(0.5,0.5), psi=0.05, report.int=10))
## time taken: 108720 ms for 48 iterations
## 2265 ms per iteration

> coef(mod.new)
## (Intercept)     lambda1     lambda2         psi 
## 13.68325416  0.41779781  0.13230415  0.02804779 
> mod.new$log.lik
## [1] -2137.799
> coef(mod.old)
## (Intercept)     lambda1     lambda2         psi 
##  13.7379830   0.4235630   0.1324561   0.0280220 
> mod.old$log.lik
## [1] -2137.773

As a point to note, the old version actually iterated the intercept in the EM, instead of just using the mean of y as the MLE. Also, for parsimonious interactions I used to use the built-in R optimiser optim() to maximise the E-step (before I found closed form expressions for lambda). Noticeably the old version had a slightly smaller log-likelihood - but I'm not sure how to account for this because so many things were changed, including the calculation of the log-likelihood using eigendecomposition instead of built-in functions. This is not too worrying, because the values are quite similar. Finally, the old version took less iterations to converge, and this is probably because the sequential EM updated Var.Y and w.hat as each parameter was updated - the new one does not. Overall, it still is quicker even with slightly more iterations.

Just to compare how long it takes to fit this model with non-parsimonious interactions:

> profvis(mod <- iprior(mathach ~ ses + schoolid + ses:schoolid, data=hsb.small, parsm=F, 
+ lambda=c(0.5,0.5,0.5^2), psi=0.05, report=10))
## time taken: 36270 ms for 96 iterations
## 378 ms per iteration

This is to illustrate that parsimonious does not necessarily mean 'fast' - the parsimonious method may take more time than the non-parsimonious method, because the EM is a bit heavier to iterate.

To fix: https://github.com/haziqjamil/iprior/commit/87f72554a0a35f6ad5a07ab108ff367f1599d251#commitcomment-18016782

EDIT: Fixed in next commit.

> microbenchmark(fnH1(cattle$id), fnH1(cattle$group))
Unit: milliseconds
               expr       min       lq     mean   median       uq       max neval cld
    fnH1(cattle$id) 10.277347 11.54698 12.41087 12.00681 12.56541  18.70607   100   a
 fnH1(cattle$group)  8.202576 10.26499 12.62646 11.61246 12.71803 103.49686   100   a

haziqj / iprior