helmholtz-analytics / heat

Distributed tensors and Machine Learning framework with GPU and MPI acceleration in Python
https://heat.readthedocs.io/
MIT License
209 stars 53 forks source link

Lanzcos init introducing NaNs into DNDarray before torch.eig call #655

Open coquelin77 opened 4 years ago

coquelin77 commented 4 years ago

To Reproduce Steps to reproduce the behavior:

  1. Which module/class/function is affected?
    • spectral
  2. What are the circumstances under which the bug appears?
    • fitting the iris dataset. only happens on 7 processes tests sometimes. Travis fails, local machine does not
  3. What is the exact error message / erroneous behavior?

    
        V, T = ht.lanczos(L, self.n_lanczos, v0)
    
        # 4. Calculate and Sort Eigenvalues and Eigenvectors of tridiagonal matrix T
    >       eval, evec = torch.eig(T._DNDarray__array, eigenvectors=True)

E RuntimeError: invalid argument 1: A should not contain infs or NaNs at /pytorch/aten/src/TH/generic/THTensorLapack.cpp:208



**Version Info**
Possibly occurring due to torch 1.6.0 release
sebimarkgraf commented 3 years ago

I had exactly this issue multiple times. The bug seems not the be in the torch.eig but occurs somewhere in the lanczos iterations.

Since this only happens with specific configurations and parts of my dataset, I suspect numerical instabilities to be the case. A good approach to communicate the problem with the user, would be a check for inf/NaN after the lanczos iterations and throwing an error/warning that tells the user that numerical instabilities were encountered.

Possible fixes: Changes to the gamma of the RBF helped in my case.

ClaudiaComito commented 2 years ago

Is this still a problem @coquelin77 ?

github-actions[bot] commented 1 year ago

Branch 655-Lanzcos_init_introducing_NaNs_into_DNDarray_before_torch_eig_call created!

mrfh92 commented 1 year ago

I cannot reproduce the error in

mpirun -np 7 python -m unittest -vf heat/cluster/tests/test_spectral.py

after removing the restriction to MPI.COMM_WORLD.size < 7...

mrfh92 commented 1 year ago

Since I cannot reproduce the error anymore, I opened a PR to remove the restriction of the tests to <7 processes.

Independent of whether this works, reviewed within #1109