noahgolmant / pytorch-hessian-eigenthings

Efficient PyTorch Hessian eigendecomposition tools!
MIT License
360 stars 43 forks source link

tests power iteration methods by comparing against np.linalg.eig results #22

Closed Tron-x closed 5 years ago

Tron-x commented 5 years ago

image I run the test file which tests the accuracy of the power iteration methods by comparing against np.linalg.eig results for various random matrix configurations er....the result seems unstable ,and the error seems too big on some matrix, I'm curious what your observations are. when run the test ,i just run python power_iter_tests.py without other configurations.

noahgolmant commented 5 years ago

If you change different hyperparameters (e.g. number of iterations, error threshold) you can reduce this error significantly. The default configuration is just faster. When I change these parameters I can achieve lower than 1% error.

Tron-x commented 5 years ago

i change the hyperparameters number of iterations from 20 to 100,error threshold from e-04 to e-05 image the error of some matrix still little higher ,seems unstable, I'm curious the reason is?

noahgolmant commented 5 years ago

It would be more accurate to average those errors over all the matrices to get the correct test performance. Each printout is a particular random wishart matrix. It's more informative to take the mean and variance of those error values. You should probably expect the variance to increase as the matrix dimensionality increases.

The performance should steadily improve as you increase the number of iterations / error threshold, and momentum accelerates it significantly.

Tron-x commented 5 years ago

you are right when i turn number of iterations to 10000 ,error threshold e-05 image

but when i keep up turn error threshold to e-06 ,iteration is still 10000,the result seems not good image

it seems that the threshold unuseful

noahgolmant commented 5 years ago

What do you mean by not good? It looks like there are only two examples that achieve over 5% error. The performance should be taken to be an average over the error of these specific tests

Tron-x commented 5 years ago

sorry,I misread the result,several more tests were done.The result mostly looks good. So what do you think are the reasons for some bad error?Does it mean that individual eigenvalues are not calculated correctly?

noahgolmant commented 5 years ago

No, it's my fault for not outputting the mean/variance of the tests to begin with. There will be some outliers because the procedure is stochastic. The error should be averaged over something like 30 tests. In those outliers, the eigenvalues are off by a factor of ~2, which can be good or bad depending on the magnitude/variance of the actual matrix spectrum.

Tron-x commented 5 years ago

when you call deflated_power_iteration() to compute the hessian value ,you use HVPOperator,it seems every 'step for' you apply.prepare_grad() ,next(self.dataloader_iter) new data inputs ,if i want to compute some moment hessian eigenvalue such as the 100th iteration when training ,as your method when computing this time point eigenvalue, i need use other batch data(you call it stochastic power iteration ),Is this reasonable? My purpose is to verify “Hessian-based Analysis of Large Batch Training and Robustness to Adversaries" figure 1

noahgolmant commented 5 years ago

Figure 1 in that paper computes the spectrum for the final models obtained at the end of training for various SGD batch sizes. To replicate the figure, just train several models using different SGD mini-batch sizes, then call compute_hessian_eigenthings on each model.

If you set the batch size parameter for stochastic power iteration to be the size of the training set, you should be able to exactly reproduce that figure since it would no longer be a stochastic estimate. You can see how they do this for a single eigenvalue in their HessianFlow repo that I link to in the acknowledgments section.

A figure that uses a small power iteration batch size might not look the exact same due to the variance in the eigenvalue estimates.

It might be the case that running power iteration on a single mini-batch produces a good estimate of the spectrum across the whole dataset, but this is different from stochastic power iteration, which uses a different batch at each step. I would need to test this out.

Tron-x commented 5 years ago

thank you ,but when i run the main.py ,the result is different when use gpu and not use not use: image use gpu:

image I'm curious what your observations are?

noahgolmant commented 5 years ago

The random seed may differ between subsequent runs if you don't set the random seed yourself, and even if you set it, the behavior may differ between the GPU and CPU for torch (see this issue https://discuss.pytorch.org/t/are-gpu-and-cpu-random-seeds-independent/142)

Tron-x commented 5 years ago

it seems not as this, when i use cpu only ,the first random vector is

image when i use gpu only the first random vector is

image the next "n for iteration" the vector of cpu and gpu is same too, but the result the different(the gap between gpu and cpu is so huge)

image

image it seem that the reason i can think out is caused by the grad_matrix is different(stochastic power iteration), the grad_matrix is always changing (),In principle, the results should not be so strongly dependent on randomness(cpu and gpu). The algorithm(stochastic power iteration) seems to estimate the eigenvalues of global samples by using the eigenvalues of subsamples(many batchs same to iteration times), as a result, the algorithm is not robust(Which result (gpu and cpu)is the right one?)

noahgolmant commented 5 years ago

Have you tried changing the batch size / number of steps / momentum term? I am planning to add some tests for this in #18 based on issue #17. This is a procedure proposed in "Accelerated Stochastic Power Iteration" and the convergence depends on the variance of the Hessian estimate. So larger networks may require larger batch sizes. In Theorem 3 of this paper, the convergence condition (for our purposes) is given in terms of the spectral norm of the covariance of the Hessian estimate, so this is to be expected.

I would only be concerned if this fails when the batch size is set to the size of the dataloader, since that is vanilla power iteration which should be stable.

Tron-x commented 5 years ago

when i run this code https://github.com/amirgholami/HessianFlow(just can compute top-one),every eporch finished, then compute the eigenvalue and eigen-vector, and call your compute_hessian_eigenthings(),but the result of top-one eigenvalue and eigen-vector by your method is very different with the method of https://github.com/amirgholami/HessianFlow, the two method have a big gap, did you have a test, I'm curious what your observations are, when i fix the random vector,the two methods still have a huge gap.

noahgolmant commented 5 years ago

Are you using a mini-batch smaller than the dataset size? I will run some tests very soon to see if this issue persists when I use the full dataset.

I am also going to try some new techniques to reduce the variance of the eigenvalue estimates. This instability is definitely an important issue even if it makes sense theoretically so I'm going to try a few different techniques to fix it up (I've referenced these in #23)

I really appreciate you testing this repo by the way! I hope that it proves useful with your experiments, especially once these issues are addressed

Tron-x commented 5 years ago

I don't understand why we need to compute eigenvalues on test sets. I tested that if we iterate over load mini-batch data on full-test-data every time, the two methods (get_eigen_full_dataset() and yours) are still in one order of magnitude although there is a gap between them. But if only eigenvalues are computed on a fixed mini-batch data set, the difference between the two methods (get_eigen() and yours) is large(not in one order of magnitude). I also tested the accuracy of power-iteration in (Hessian Flow) using matlab, which is correct, Can you put two pieces of code together (Hessian Flow and yours) and test them? I'm not so sure about my test. I think it's meaningful to perfect this calculation. thank you~

noahgolmant commented 5 years ago

@Tron-x

I just wanted to update this to reference my new results in #23. I compared the results to np.linalg.eig decomposition for small models. I tested power iteration on the full dataset, power iteration on a fixed mini-batch, and stochastic power iteration using random mini-batches for each hessian-vector product evaluation.