Closed iamgroot42 closed 4 months ago
From what I have concluded, there isn't anything wrong about the implementation itself, but just the way the Hessian's values are distributed. In most cases that I tried, the condition number of the squared Hessian is really large which basically means (I - H^2/V)
stays really close to 1 (and because of limited precision, ends up being really close to 1). This isn't fixed by choosing an arbitrarily large V
or a "tight" V
(by computing the exact Hessian), since the ones in the identity matrix's diagonals still contribute the 1
where the corresponding eigenvalues are exceedingly small.
Hi - thanks for getting in touch, and sorry that we haven't got back to you sooner!
Thanks for flagging the typo in Algorithm 2 in our paper - we'll get that fixed.
We also noticed a substantial gap between the power series approach and the true inverse-Hessian product, which matches our empirical results from Figure 10 in which this convergence, even in a lower-dimensional problem, is quite slow. The ill-conditioning you describe seems plausible, since we're aware of results that show very small eigenvalues become significantly more prevalent in higher dimensions. If memory serves, we briefly investigated a Levenberg-Marquardt-like damping coefficient to mitigate this problem, but couldn't immediately derive a series incorporating it.
Ultimately, our strategy was to design a matrix transformation which predictably changes the Hessian's eigenvalues without having to explicitly decompose the Hessian, then design a (possibly approximate) implementation of that transformation which avoids any explicit representation of the Hessian. So we hope there are better ways of doing that than we've managed here, which would then give better behaviour! But as you say, small eigenvalues are a bit of a problem in the current approach.
Thanks for the clarification and your detailed response!
Hello,
First off, thanks a lot for such well-documented and neat-code. I am extending to run this in PyTorch and ran into no hiccups at any point, thanks to the well-written code and clearly-explained algorithms in the paper!
I implemented and tested this iHPV method on a simple 2-layer MLP and am comparing it with the exact iHVP to see how close the result here is, but it seems to be quite off, even when run for a lot of (1000) iterations.
For reference, here is my implementation in Pytorch
Here,
hvp_module.hvp
is a utility function that returns the HVP (and is verifiably correct- I tested it independently with the exact Hessian as well). I also tried replacing the actual Hessian with a artificially-constructed positive semi-definite matrix, but even then the computed answer is off by a lot.On that note, the implementation for Algorithm 2 uses
epsilon[m+1, s-2]
correctly https://github.com/rmclarke/SeriesOfHessianVectorProducts/blob/21db435dd7d6f439b5c5174fe72d9916db597e75/util.py#L131 but in the paperepsilon[m+1, s-2]
is referenced incorrectly.