Open coquelin77 opened 4 years ago
I had exactly this issue multiple times.
The bug seems not the be in the torch.eig
but occurs somewhere in the lanczos iterations.
Since this only happens with specific configurations and parts of my dataset, I suspect numerical instabilities to be the case.
A good approach to communicate the problem with the user, would be a check for inf/NaN
after the lanczos iterations and throwing an error/warning that tells the user that numerical instabilities were encountered.
Possible fixes: Changes to the gamma of the RBF helped in my case.
Is this still a problem @coquelin77 ?
I cannot reproduce the error in
mpirun -np 7 python -m unittest -vf heat/cluster/tests/test_spectral.py
after removing the restriction to MPI.COMM_WORLD.size < 7...
Since I cannot reproduce the error anymore, I opened a PR to remove the restriction of the tests to <7 processes.
Independent of whether this works, reviewed within #1109
To Reproduce Steps to reproduce the behavior:
What is the exact error message / erroneous behavior?
E RuntimeError: invalid argument 1: A should not contain infs or NaNs at /pytorch/aten/src/TH/generic/THTensorLapack.cpp:208