Tests fail on Apple M1 - Githubissues

jjgoings commented 3 years ago

Compiling as usual, no apparent errors, but the tests fail miserably on Apple M1 with Python 3.9.

SCF does not converge or converges to very wrong answers, e.g. the sample-input.py case of STO-3G water:

E(SCF)    =  -92.762209945728 in 36 iterations
  Convergence:
    FPS-SPF  =  5.6769413580857e-09
    RMS(P)   =  5.74e-09
    dE(SCF)  =  1.82e-10
  Dipole X =  0.00000000
  Dipole Y =  -10.13611489
  Dipole Z =  -0.00000000
E(MP2) =  -92.81887498561838

jjgoings commented 3 years ago

I think it might be my NumPy failing during diagonalizations. Doing pytest --verbose --pyargs numpy

=============================================== short test summary info ================================================
FAILED core/tests/test_ufunc.py::TestUfuncGenericLoops::test_unary_PyUFunc_O_O_method_full[reciprocal] - AssertionErr...
FAILED linalg/tests/test_linalg.py::TestSVDHermitian::test_herm_cases - AssertionError: In test case: <LinalgCase: hc...
FAILED linalg/tests/test_linalg.py::TestSVDHermitian::test_generalized_herm_cases - AssertionError: In test case: <Li...
FAILED linalg/tests/test_linalg.py::TestPinvHermitian::test_herm_cases - AssertionError: In test case: <LinalgCase: h...
FAILED linalg/tests/test_linalg.py::TestPinvHermitian::test_generalized_herm_cases - AssertionError: In test case: <L...
FAILED linalg/tests/test_linalg.py::TestEighCases::test_herm_cases - AssertionError: In test case: <LinalgCase: hcsin...
FAILED linalg/tests/test_linalg.py::TestEighCases::test_generalized_herm_cases - AssertionError: In test case: <Linal...
============ 7 failed, 15420 passed, 218 skipped, 20 xfailed, 3 xpassed, 251 warnings in 164.95s (0:02:44) =============

So if my NumPy linalg.eigh is failing, it makes sense all the tests with SCF (and postSCF) would fail, but the general integral tests (such as testing Boys function and S, V, T integrals in tests/test0012.py and tests/test0013.py) pass.

shiozaki commented 3 years ago

Hey Josh - I don't know why I ended up here (I guess I clicked a few times from the FQE repo) but just two cents: numpy.eigh is a wrapper for dsyevd (divide-and-conquer), which is faster but inherently unstable. For example, the reference implementation states

The divide and conquer algorithm makes very mild assumptions about floating point arithmetic. It will work on machines with a guard digit in add/subtract, or on those binary machines without guard digits which subtract like the Cray X-MP, Cray Y-MP, Cray C-90, or Cray-2. It could conceivably fail on hexadecimal or decimal machines without guard digits, but we know of none.

To test what you are suspecting, you could try directly calling dsyev from scipy

eig, vec, _ = scipy.linalg.lapack.dsyev(data)

I believe the dsyev function uses the Housholder transformation and it is deterministic (and will not fail). I have encountered in some cases where the condition number is really bad and numpy.eigh fails.

jjgoings commented 3 years ago

@shiozaki Hey Toru - not sure how you got here either, but very glad you did! You are 100% correct. And since scipy's eigh doesn't default to divide-and-conquer, swapping numpy.linalg.eigh for scipy.linalg.eigh is an easy fix for now. Thanks for the insight!

ThomasMerkh commented 2 years ago

I ran into this same issue (M1, Python 3.9) using np.linalg.eigh, where switching to scipy.linalg.eigh fixed it.

Looking deeper into why, I found that on the M1, np.linalg.eigh is returning non-unit eigenvectors and this throws off subsequent computations. As such, I was able to fix my code and still use np.lingalg.eigh by explicitly normalizing the eigenvectors before using them.

Just figured I'd comment in case you are partial toward using np.linalg.eigh and your problem too depends on unit length eigenvectors. I may look deeper into why the heevd function (what np.linalg.eigh uses) gives different results on the M1 compared to any non-M1.

jjgoings commented 2 years ago

Interesting, thanks for your comment @ThomasMerkh!

I was chatting with someone about this, and it may be related to what Toru mentioned about not error handling correctly. If it somehow fails, then it won't normalize properly. Not sure what's up with M1 -- I'm pretty ignorant about what's different. But you seem to be getting correct eigenvalues, so not sure what accounts for this.

Not sure if it is related to https://github.com/numpy/numpy/issues/19472

jjgoings / McMurchie-Davidson

Tests fail on Apple M1 #20