Closed themightyoarfish closed 4 years ago
Have you tried changing the batch size? I added the utility to allow for larger batch sizes than can fit on the GPU.
Adding the momentum term has also helped me. Have you tried that?
I will add some tests and plots for (a) small networks where I can compute the hessian exactly with numpy (b) looking at convergence as a function of batch size vs. number of steps
Intuitively, power iteration converges once you reach a fixed point of the operator T(v) = Av / (||Av||). The convergence rate is measured by the residual (which should be a cauchy sequence) ||T(v_t) - v_t|| for the estimate iterate v_t. If the variance of v_t is too high this may not be guaranteed to converge (I haven't done the math though... could be wrong), like SGD.
I haven't done more extensive testing and will likely not get to it soon. I am curious though, maybe I'll do some more comprehensive studies at some point.
It seems like this is a bigger problem with using "reasonable" batch sizes for stochastic power iteration. I will add some new tests / techniques to address this (#23)
In case you are curious, I recently performed tests on small networks to validate the technique with np.linalg.eig
in #23. I have yet to run ablation studies on accuracy as a function of number of steps / mini-batch size / model size.
This does not appear to be an implementation issue, and reducing the variance can be achieved by increasing the batch size (or do the full dataset), continuing to increase the number of steps, or by using Lanczos. (Just started using this again, so finally closing the issue)
I'm playing around with this tool and one thing I observed is that the estimate for the largest eigenvalue doesn't seem to converge to some fixed value the more steps I take. As a matter of fact, the traces for 5, 10, 15, 20, 30, 40, ..., 90 steps seem to be rather noisy and different. Since the random sampling from the dataset is different each time, I guess some differences are to be expected, but can we hope for convergence in 20 steps for small batch sizes on standard datasets? (because that is the default). I could make some more detailed plots, but I'm curious what your observations are.