qsimulate-open / bagel

Brilliantly Advanced General Electronic-structure Library
GNU General Public License v3.0
96 stars 43 forks source link

Testcase he3_svp_asd-dmrg fails with MPICH 4.0.1 #242

Open mbanck opened 2 years ago

mbanck commented 2 years ago

First reported here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1006788

If I revert MPICH to 3.4.1, the testcase runs fine. If I use MPICH 4.0.1, it fails with

 ===== Starting sweeps =====

  o convergence threshold: 1.0000e-08
  iter state         sweep average     sweep range      dE average
  ERROR: EXCEPTION RAISED:  dsyev/pdsyevd failed in Matrix
mbanck commented 2 years ago

Hrm, I've found #240 now which is related - in that you mentioned that this test case is bound to fail, but it sounds like due to numerical noise, not dsyev/pdsyevd? Is this a separate issue?

The Ubuntu testsuite results in that other issue are not very helpful, it just says FAILED. I've changed the testsuite script to dump the last 50 lines of output for failed test cases now.

shiozaki commented 2 years ago

Hi Michael, sorry for the very late response. This test is supposed to converge to incorrect results and is not supposed to throw errors. I am not exactly sure what this is without reproducing myself. Thanks for reporting.

mbanck commented 2 years ago

It seems to be flakey - the test ran fine again on the next upload.

Not sure whether this can be tracked down definitively - I downgraded the corresponding Debian bug, but that's not really an option for Github.

I'll run another test build overnight and see what the current status is on my personal box.

AdrianBunk commented 2 years ago

New information from the Debian bug:

it seems related to the host that runs the 
test. I.e. the test fails on our beefy amd64 host (ci-worker13) with 64 
cores and 256GB RAM, but seems to pass on the others.

The error on s390x is the same by the way (that has 10 cores and 32GB RAM).
mbanck commented 1 year ago

So two things seem to work-around this:

  1. downgrading mpich from 4.0.x to 3.x
  2. setting BAGEL_NUM_THREAD=4 (it fails with 8 or 16)