marekandreas / elpa

A scalable eigensolver for dense, symmetric (hermitian) matrices (fork of https://gitlab.mpcdf.mpg.de/elpa/elpa.git)
Other
25 stars 11 forks source link

Eigenvector Check: what do check on errmax=0 do? #50

Closed yizeyi18 closed 4 months ago

yizeyi18 commented 4 months ago

During make check of elpa-2023.11.001 on my pc, all complex evp test failed; it seems the fail comes by a n if in test/shared/test_check_correctness_template.F90, line 501:

500        if (nev .ge. 2) then
501          if (errmax .gt. tol_res .or. errmax .eq. 0.0_rk) then
502            status = 1
503          endif
504        else
505          if (errmax .gt. tol_res) then
506            status = 1
507          endif
508        endif

The check errmax .eq. 0.0_rk confuses me. What do this check do, like, zero max-error would do harm in some calculation? Similar check also appears in other files like line 450, test/shared/test_analytic_template.F90, suggests it is set with purpose.

EDIT: These checks seems come from far old commits like https://github.com/marekandreas/elpa/commit/b9bbba2f1672cb01aec8581aead2345be000b540, but with no more info. Maybe deleting this would cause bug?

marekandreas commented 4 months ago

We consider an errmax of exactly 0.0 (to machine precision) a bug, since this is normally never achieved for "normal" size matrices. For very small matrices lda<1000 one can get an errmax of 0.0, but normally it is at least 10e-15. Hence this check.

yizeyi18 commented 4 months ago

We consider an errmax of exactly 0.0 (to machine precision) a bug, since this is normally never achieved for "normal" size matrices. For very small matrices lda<1000 one can get an errmax of 0.0, but normally it is at least 10e-15. Hence this check.

So normally it may caused by compute error, instead of compute just makes it here? It sounds reasonable, I would recheck the test compute workflow.

EDIT: I changed linked blas from aocc-4.0 compiled openblas to gcc-13.1.0 compiled openblas, and all error = 0 together with openblas segmentation fault disappears; it seems the problem comes from aocc treated blas, not elpa related. Thank you!