Open haziqj opened 8 years ago
Hi Haziq,
Var.Y
, and computing solve(Var.Y) %*% a
for a vector a
.For solving linear equations stably and fast (though I did initially make an error here as we found last week):
(\*solve (A+s\*I)x=b; stable for small s and singular A\*)
Clear[linsolve]
linsolve[a_, s_, b_] := Module[{eval, evec},
{eval, evec} = Eigensystem[a];
evec\[Transpose].((evec.b)/(eval + s))
];
The following generates a linear solve function, so the eigensystem only needs to be computed once if eg both solve(A,b) and solve(A,c) need to be computed, best to implement this function but maybe try out the above first.
(\*create function u such that u[c] solves (A+s*I)x=c\*)
linsolve[a_, s_] := Module[{eval, evec},
{eval, evec} = Eigensystem[a];
evec\[Transpose].((evec.#)/(eval + s)) &
];
Here &
after an expression denotes a "pure function" with argument #
.
Computing a determinant directly using in-built functions is potentially especially unstable (and I think both R and Mathematica use Intel libraries for this, so no fundamental differences). For the log determinant, the following is a standard relatively stable and fast method:
logdet[m_]:=2Total[Log[Diagonal[CholeskyDecomposition[(m+m\[Transpose])/2]]]]
However, more efficient would be to write a procedure that combines the above linsolve and also compute the log determinant, in the above procedure a line like this would need to be added:
logdet=Total[Log[eval+s]]
and give this as output as well. Still need to implement this myself.
Actually now did this and the following seems most convenient for the present purposes, so output the eigenvalues of H \Psi H
(these can be used eg for computing determinants but also for Tr[Inverse[Var.Y]])
and the linearsolve function:
EvalAndLinsolve[a_, s_] := Module[{eval, evec},
{eval, evec} = Eigensystem[a];
{eval, evec\[Transpose].((evec.#)/(eval + s)) &}
];
intercept=mean(Y)
, why might this not work?Hmm, some formatting goes wrong, I think it's not a problem but let me know if anything unclear.
Update line logdet=Total[Log[eval+s]]
No problem, I formatted your comments so can see better (weird that I can do that actually).
OK, let me digest this. The implementation of your point 2 above seems worth doing.
As for the intercept=mean(Y)
- for some reason I didn't think this was always the case. When I was initially playing around with the iprior code, I got the impression this was not the case, especially when mixing around the kernels, e.g. Canonical + Pearson, or when using the FBM kernel. I need to convince myself otherwise.
Ax=b
, use solve(A, b)
. When b
is not specified, then the inverse is calculated. If our intention is to calculate A^{-1} %*% b
then it is way faster to do solve(A,b)
than doing solve(A) %*% b
. I get that the advantage of coding ourselves like the way you explain above would allow us to store the eigensystem of A
for later usage.alpha
then lambda_1
then ...
then lambda_p
then psi
. At every stage of the update, the eigensystem A
essentially changes, because A = psi * H_lambda %*% H_lambda
. So there is no point in storing.W.hat = solve(Var.Y) + tcrossprod(w.hat)
which is the tilde W in your I-prior manuscript. So really, I do need to calculate this inverse.However, with all these new matrix tricks I've learnt, there may be potential in speeding up the programme by going for the regular EM and abandoning the exponential family EM. This seems to be a big undertaking, which I will leave for a bit later. I am doing the predicted log-likelihood feature, and after that the FBM kernel (with fixed Hurst coefficient first).
As for the log-determinant, and subsequently the calculation of the log-likelihood, I can't seem to beat the built-in function dmvnorm()
. I even tried the Cholesky trick. It seems the authors of dmvnorm()
have already optimised their function (in R anyway).
Now interestingly, there is another R package which uses C++ code to evaluate multivariate normal densities, which claim to be faster than dmvnorm()
. The link is https://github.com/mfasiolo/mvnfast and the R package is called mvnfast.
Of course if I eventually get the linear solver as you do it then it would be nice to get the log-determinant for free from by taking the sum of the eigenvalues, as you point out above. A mission for another day.
Yes I think in terms of speed for a single evaluation you won't beat the built in solvers, which should be close to the best possible. But as you say something can be saved by doing an eigendecomposition just once for every iteration, the saving can be substantial. But maybe even more important, sometimes you can get numerical problems by using built in functions, the functions I sent exploit the special structure of Var.Y to get better stability.
Regarding 2. above, yes, in that case there may not be a point in storing.
Regarding 3., indeed W.hat = solve(Var.Y) + tcrossprod(w.hat), but you only need W.hat in order to compute tr[A W.hat] for certain matrices A (nb am using mix of mma and R notation), so W.hat need not be computed explicitly. However computing W.hat might not be a major bottleneck and may make programming a bit easier.
Overall, it's a bit of a puzzle to figure out how best to optimize, but quite a bit of saving and improved stability can be attained.
All of the original issues will be addressed in the following update. Namely:
H.mat
, its square H.matsq
and the like are called at each iteration. Computing squared matrices is expensive, so this is done and stored at the beginning and called when required.Var.Y.inv
is now calculated using eigendecomposition trick above. The eigendecomposition is also used to calculate the log-determinant in the log-likelihood value, which is more stable. Note: sometimes we don't need explicit value of Var.Y.inv
but rather Var.Y.inv %*% b
where b
might be a vector. In such cases, the linear solver described is less expensive to implement and this is also included in the upcoming udpate.mean(Y)
. Knowing this, the intercept is no longer a part of the EM update.lambda
parameters are updated sequentially. Last time, we used to update Var.Y.inv
everytime lambda[k]
was updated, but this proved to be quite expensive. Convergence can still be achieved if these were not updated at every [k]
, but at every EM iteration instead.I have also identified the code which takes the longest time to run in the iprior
package, and these are:
S.mat
which involves manipulation of H.mat
matrices and lambda
.H.mat.lamsq
.Var.Y.inv
.Each are addressed below:
S.mat
The bottleneck involving S.mat
really is the many matrix multiplications that need to be done. By doing these at the beginning and storing them, then this issue goes away.
I have ported this part of the code to C++ using the Eigen library. The default R function is eigen()
. By passing the option symmetric=T
, the function can run a little bit quicker. But porting the eigendecomposition to C++ is the fastest. The results are for a 1000 x 1000 matrix.
> microbenchmark(
"eigen.R"=eigen(Z),
"eigen.R.symm"=eigen(Z, symmetric=T),
"Rcpp"=EigenCpp(Z)
)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## eigen.R 1442.4800 1496.9692 1531.5655 1522.1981 1567.7182 1640.340 100 c
## eigen.R.symm 1394.8606 1463.5625 1483.7722 1479.3523 1497.9794 1657.661 100 b
## Rcpp 883.0575 923.9286 953.3621 945.0064 973.2185 1129.158 100 a
H.mat.lamsq
Typically, H.mat.lamsq = H.mat.lam %*% H.mat.lam
is O(n^3), but the Eigen library in C++ does something clever - when squaring a symmetric matrix only the upper/lower triangle is used and duplicated. This results in a fast squaring operation. For a 1000 x 1000 matrix, the timings are:
> microbenchmark(
"%*%"=Z%*%Z,
"Rcpp"=FastSquare(Z)
)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## %*% 573.78687 607.06477 641.69264 636.73526 674.1643 739.1447 100 b
## Rcpp 87.65426 93.16196 97.42494 96.18368 101.1119 115.4750 100 a
Var.Y.inv
As mentioned, we use the fact that Var.Y = A + diag(s)
, and if (u,V)
are the eigenvalues and eigenvectors of A, then the inverse can be obtained by Var.Y.inv = V %*% diag(1/(u+s)) %*% t(V)
. A better way to do this is by expanding diag(1/(u+s)
row-wise into a N x N matrix, and then taking the Hadamard product, as follows: Var.Y.inv = V * matrix(1/(u+s), nr=N, nc=N, byrow=T) %*% t(V)
. An even faster way is by doing t(t(V) * 1/(u+s)) %*% t(V)
. However, using C++ we can gain even more speed.
> microbenchmark(
"naive"= tmp1 <- Z %*% diag(z) %*% t(Z),
"Hadamard"= tmp2 <- (Z * matrix(z, nr=1000, nc=1000, byrow=T))%*%t(Z),
"smart_transpose"= tmp3 <- t(t(Z) * z) %*% t(Z),
"Rcpp"= tmp4 <- FastVdiag2(Z, z)
)
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## naive 1181.0930 1324.0387 1380.5610 1362.8738 1446.2571 1688.9346 100 c
## Hadamard 592.4630 677.1544 743.0098 708.9243 794.9454 1009.0804 100 b
## smart_transpose 607.3514 669.5603 724.9866 702.4397 760.3104 946.3763 100 b
## Rcpp 160.4689 173.4585 199.6922 181.9600 200.6700 432.0324 100 a
> all.equal(tmp1, tmp2, tmp3, tmp4)
## [1] TRUE
There are probably a few more tweaks that can be made. For instance, we don't really need H.mat.lamsq
, but tr(H.mat.lamsq %*% W.hat)
, where W.hat=Var.Y.inv + tcrossprod(w.hat)
- so can the matrix multiplication be avoided?
In any case, the changes that have been implemented do bring some improvement in speed. In practice, this is more than 4 times faster. For the cats
dataset:
> system.time(mod.new <- iprior(Hwt~., data=cats, lambda=c(10,10), psi=10, progress="none"))
## user system elapsed
## 6.573 1.096 7.631
> system.time(mod.old <- iprior.old(Hwt~., data=cats, lambda=c(10,10), psi=10, progress="none"))
## user system elapsed
## 33.979 2.754 36.565
> coef(mod.new)
## (Intercept) lambda1 lambda2 psi
## 10.630555556 -0.002466022 1.002298802 0.477464200
> mod.new$log.lik
## [1] -260.3357
> coef(mod.old)
## (Intercept) lambda1 lambda2 psi
## 10.630555556 -0.002465982 1.002274360 0.477481411
> mod.old$log.lik
## [1] -260.3357
And for the stackloss
dataset:
> system.time(mod.new <- iprior(stack.loss ~ ., data=stackloss, lambda=1:3, psi=10, progress="none"))
## user system elapsed
## 0.226 0.016 0.244
> system.time(mod.old <- iprior.old(stack.loss ~ ., data=stackloss, lambda=1:3, psi=10, progress="none"))
## user system elapsed
## 1.588 0.051 1.629
> coef(mod.new)
## (Intercept) lambda1 lambda2 lambda3 psi
## 17.52380952 0.04082678 0.22269432 -0.01226573 0.10574022
> mod.new$log.lik
## [1] -56.34791
> coef(mod.old)
## (Intercept) lambda1 lambda2 lambda3 psi
## 17.52380952 0.04082316 0.22258619 -0.01226840 0.10576286
> mod.old$log.lik
## [1] -56.34791
The action item after this next update will be: Are there any algorithmic improvements that can be made, e.g. nice matrix tricks to avoid matrix multiplications?
After yet another round of tweaking, managed to speed up the programme which now works for parsimonious interactions. Consider the hsb.small
dataset (see here). The timings show a greater than four-fold improvement:
> profvis(mod.new <- iprior(mathach ~ ses + schoolid + ses:schoolid, data=hsb.small,
+ lambda=c(0.5,0.5), psi=0.05, report.int=10))
## time taken: 26130 ms for 68 iterations
## 384 ms per iteration
> profvis(mod.old <- iprior.old(mathach ~ ses + schoolid + ses:schoolid, data=hsb.small,
+ lambda=c(0.5,0.5), psi=0.05, report.int=10))
## time taken: 108720 ms for 48 iterations
## 2265 ms per iteration
> coef(mod.new)
## (Intercept) lambda1 lambda2 psi
## 13.68325416 0.41779781 0.13230415 0.02804779
> mod.new$log.lik
## [1] -2137.799
> coef(mod.old)
## (Intercept) lambda1 lambda2 psi
## 13.7379830 0.4235630 0.1324561 0.0280220
> mod.old$log.lik
## [1] -2137.773
As a point to note, the old version actually iterated the intercept in the EM, instead of just using the mean of y
as the MLE. Also, for parsimonious interactions I used to use the built-in R optimiser optim()
to maximise the E-step (before I found closed form expressions for lambda
). Noticeably the old version had a slightly smaller log-likelihood - but I'm not sure how to account for this because so many things were changed, including the calculation of the log-likelihood using eigendecomposition instead of built-in functions. This is not too worrying, because the values are quite similar. Finally, the old version took less iterations to converge, and this is probably because the sequential EM updated Var.Y
and w.hat
as each parameter was updated - the new one does not. Overall, it still is quicker even with slightly more iterations.
Just to compare how long it takes to fit this model with non-parsimonious interactions:
> profvis(mod <- iprior(mathach ~ ses + schoolid + ses:schoolid, data=hsb.small, parsm=F,
+ lambda=c(0.5,0.5,0.5^2), psi=0.05, report=10))
## time taken: 36270 ms for 96 iterations
## 378 ms per iteration
This is to illustrate that parsimonious does not necessarily mean 'fast' - the parsimonious method may take more time than the non-parsimonious method, because the EM is a bit heavier to iterate.
EDIT: Fixed in next commit.
> microbenchmark(fnH1(cattle$id), fnH1(cattle$group))
Unit: milliseconds
expr min lq mean median uq max neval cld
fnH1(cattle$id) 10.277347 11.54698 12.41087 12.00681 12.56541 18.70607 100 a
fnH1(cattle$group) 8.202576 10.26499 12.62646 11.61246 12.71803 103.49686 100 a
After some timing tests, these are the bottlenecks in the programme:
H.mat %*% H.mat
. Need some clever way to reduce these occurences. This is doable I think. The plan is to do all possible matrix multiplications and store them, and use them as needed. This way they are not evaluated again and again.Var.Y
. The default R method is to usesolve(Var.Y)
. A faster, albeit less stable, way is to use the Cholesky decompositionchol2inv(chol(Var.Y))
. This is 5 times faster than the regular method.Var.Y
. If I can find another method which does not depend on this inverse, then I can speed up the programme. For example, sometimes it is as simple asintercept = mean(Y)
.dmvnorm()
also takes a substantial amount of time, but not sure how much can be reduced by building own function. For reference,dmvnorm()
takes 1/8th of the time to do asolve()
Will look into this.@Wicher2 can you comment on the above points, specifically 2-4? Thanks.