qsimulate-open / bagel

Brilliantly Advanced General Electronic-structure Library
GNU General Public License v3.0
92 stars 44 forks source link

Test suite hangs/segfaults on exceptions on Debian #85

Closed mbanck closed 7 years ago

mbanck commented 7 years ago

Once a test case throws an exception, the src/TestSuite executable hangs for me. If I run strace on the process I see

futex(0x7ff6b3a12b00, FUTEX_WAIT_PRIVATE, 2, NULL

and that's it. The first such issue happens in cuh2_ecp_hf, and I am not quite sure what the issue, I get at most that much output:

../BAGEL cuh2_ecp_hf.json.orig 
    barriere                        

  * using 4 threads per process

  ===============================================================
    BAGEL - Freshly leavened quantum chemistry                   
  ===============================================================

  ERROR ON RANK 0: EXCEPTION RAISED:bad_function_call
terminate called after throwing an instance of 'std::bad_function_call'
  what():  bad_function_call
Aborted

Other test cases typically hang/segfault get the line

    - linear dependency detected:    1 /    1    min eigenvalue:     0.0000e+00    max eigenvalue:     0.0000e+00
    * Using canonical orthogonalization due to linear dependency

  ERROR ON RANK 0: EXCEPTION RAISED:Too much linear dependency in guess vectors provided to DavidsonDiag; cannot obtain the requested number of states.
terminate called after throwing an instance of 'std::runtime_error'
  what():  Too much linear dependency in guess vectors provided to DavidsonDiag; cannot obtain the requested number of states.
Aborted

If I run BAGEL directly on the inputs in test/, I get a "normal" exit (but the exception is thrown), if I run TestSuite, I either get a hang or a segfault.

Is this a known problem with the Boost test setup, and is there a way around this? I guess those linear dependencies are another problem, but having those tests fail cleanly would make things easier.

shiozaki commented 7 years ago
  ERROR ON RANK 0: EXCEPTION RAISED:bad_function_call
terminate called after throwing an instance of 'std::bad_function_call'
  what():  bad_function_call
Aborted

This does not originate from BAGEL (there is no place in BAGEL that this could happen). I suspect that something is wrong with linking or BOOST build itself (because boost test requires a library at runtime that is not required for the standard runs).

In principle we could limit the version of BOOST etc more strongly, but I want to keep it in this way because most people have been able to use it. Note I have no desire to make BAGEL workable with any platform/BOOST version/compiler.

Unless you have further evidence that this comes from BAGEL, I will close this with won't-fix label in a few days.

mbanck commented 7 years ago

Boost is 1.62, this is default Debian unstable, and it has been an issue for at least a year, I just hadn't reported it (I initially commented out all the test cases that made TestSuite hang).

Note that it fails on the development release of Ubuntu as well: https://launchpadlibrarian.net/300926882/buildlog_ubuntu-zesty-amd64.bagel_0.0~git20161215-1_BUILDING.txt.gz

What about

ERROR ON RANK 0: EXCEPTION RAISED:Too much linear dependency in guess vectors provided to DavidsonDiag; cannot obtain the requested number of states. terminate called after throwing an instance of 'std::runtime_error'

That one clearly seems to come from BAGEL (the ERROR ON RANK from main.cc:238, the exception from util/math/davidson.h).

shiozaki commented 7 years ago

BAGEL works with Boost 1.62 at least on RedHat and CentOS as well as Mac OS. Probably something is different on Debian (and I don't have one at hand). BAGEL is written in a strictly standard compliant way, though.

Are you doing this for scientific purposes? If not, you should not be compiling BAGEL. This is still not a program for general use, which is to be released officially later this year.

Or please specify the line of code that is causing this error.

Otherwise I'll close it in a few days.

That one clearly seems to come from BAGEL (the ERROR ON RANK from main.cc:238, the exception from util/math/davidson.h).

This typically happens when there is a compatibility issue with the BLAS (or some other) libraries.

mbanck commented 7 years ago

So are you saying that TestSuite works on Red Hat even if an exception is thrown by BAGEL or that no exceptions are thrown on those platforms? Again, it also fails on Ubuntu, but maybe that is not an interesting target platform either.

I'm happy to look into the exceptions (not sure about the former, maybe it is related to ECPs?), but I'm using the reference implementations of BLAS (version 3.7.0, from http://www.netlib.org/blas/#_reference_blas_version_3_7_0).

shiozaki commented 7 years ago

No exceptions are thrown on other OS. The ECP test is rock solid (tested for >3 years without any failure on these environments).

shiozaki commented 7 years ago

PS: depending on the BLAS implementation, you may have to define

-DZDOT_RETURN

Please see src/util/f77.h. Most of the efficient BLAS libraries (including MKL) should be used without this, but I have no idea about the reference BLAS.

mbanck commented 7 years ago

Thanks, defining ZDOT_RETURN has helped considerably; this makes the cuh2_ecp_hf test case pass and the ZHARRISON test cases at least do not throw exceptions anymore (they still fail, though). I am still getting a segfault/exception in hf_svp_second_coulomb, however:

  === Dirac CASSCF iteration (svp) ===

   * Using the second-order algorithm

         0   0     -39238.10046430     2.37e+03      0.19

         res : 6.32e+04   lamb: 1.13e+00   eps : -4.85e+03   step: 9.98e-01    0.30
         res : 5.43e+04   lamb: 1.25e+00   eps : -5.21e+04   step: 1.00e+00    0.31
         res : 1.78e+04   lamb: 1.20e+00   eps : -8.67e+04   step: 1.00e+00    0.30
         res : 6.50e+03   lamb: 1.21e+00   eps : -8.90e+04   step: 1.00e+00    0.30
         res : 2.15e+03   lamb: 1.26e+00   eps : -8.83e+04   step: 1.01e+00    0.31
         res : 1.07e+03   lamb: 1.32e+00   eps : -8.70e+04   step: 9.99e-01    0.30
         res : 1.04e+03   lamb: 1.50e+00   eps : -8.41e+04   step: 9.98e-01    0.31
         res : 3.44e+02   lamb: 1.52e+00   eps : -8.38e+04   step: 9.99e-01    0.31
         res : 9.70e+01   lamb: 1.53e+00   eps : -8.37e+04   step: 9.99e-01    0.30
         res : 5.74e+01   lamb: 1.57e+00   eps : -8.31e+04   step: 9.99e-01    0.30
         res : 2.03e+01   lamb: 1.59e+00   eps : -8.29e+04   step: 9.94e-01    0.30
         res : 3.22e+01   lamb: 1.66e+00   eps : -8.21e+04   step: 9.64e-01    0.29
         res : 2.71e+01   lamb: 1.75e+00   eps : -8.12e+04   step: 9.89e-01    0.30
         res : 8.11e+00   lamb: 1.76e+00   eps : -8.11e+04   step: 9.89e-01    0.30
         res : 1.84e+00   lamb: 1.76e+00   eps : -8.11e+04   step: 9.90e-01    0.30
         res : 3.92e-01   lamb: 1.75e+00   eps : -8.12e+04   step: 9.86e-01    0.30
         res : 2.15e-01   lamb: 1.75e+00   eps : -8.12e+04   step: 9.88e-01    0.29
         res : 1.02e+00   lamb: 1.81e+00   eps : -8.07e+04   step: 9.45e-01    0.30
         res : 5.81e-01   lamb: 1.85e+00   eps : -8.03e+04   step: 9.21e-01    0.30
  ERROR ON RANK 0: EXCEPTION RAISED:max size reached in AugHess
terminate called after throwing an instance of 'std::runtime_error'
  what():  max size reached in AugHess
Aborted

The above is from running the input file manually with BAGEL.

In general, it seems the exception handling of Boost and/or TestSuite is problematic and it hangs/segfaults when it encounters one. I think this is an issue, as it makes testing BAGEL on new platforms somewhat unreliable.

shiozaki commented 7 years ago

Can you link MKL and try again? I want to isolate this problem to BLAS (my guess is that it all works). Just do at configure time

--enable-mkl
mbanck commented 7 years ago

I'm running this on a Debian project porting machine, so there's no MKL available there. I've now tried linking to openblas and ATLAS (or rather pointing libblas.so.3 at those two), and I get slightly different exceptions and/or pathological convergence behaviour, but the same test cases seem to throw exceptions.

However, back to "exception makes testsuite hang" part: If I remove the "basis" : "svp", line from test/hf_svp_hf.json, I can get a trivial exception if I run it with BAGEL:

  ERROR ON RANK 0: EXCEPTION RAISED:There is no basis specification
terminate called after throwing an instance of 'std::runtime_error'
  what():  There is no basis specification
Aborted

Now, if I run ../TestSuite --log_level=all --run_test=TEST_SCF, I get a failure immediately and a command prompt is returned:

    barriere                        

Running 1 test case...
Entering test module "Suites"
../src/testimpl/test_scf.cc(84): Entering test suite "TEST_SCF"
../src/testimpl/test_scf.cc(86): Entering test case "DF_HF"

*** 1 failure is detected in the test module "Suites"
$  echo $?
201

If I add a second test suite (e.g. --run_test=TEST_SCF,TEST_PROP), the latter isn't run anymore, whereas I think it would be better if it continued.

If I just run the TestSuite as-is, I either get a segfault or hang

../TestSuite --log_level=all 
    barriere                        

Running 27 test cases...
Entering test module "Suites"
../src/testimpl/test_scf.cc(84): Entering test suite "TEST_SCF"
../src/testimpl/test_scf.cc(86): Entering test case "DF_HF"
Segmentation fault
$ echo $?
139

So there seems to be some behaviour change between the two.

shiozaki commented 7 years ago

Let us postpone till the end of summer (I am teaching Gen Chem and no bandwidth to work on this). We might state that MKL is required - first of all, it is free for academic users, second, it is faster than ATLAS, third, there are extensions (transpose etc) that BAGEL uses extensively if linked.

PS: It's ready stated in the Wiki - I don't think my users would try ATLAS.

For BLAS and Lapack, Intel MKL is highly recommended. Note that MKL on Linux comes with an optimized scalapack library (and it is now free for academic users!).

mbanck commented 7 years ago

Just as a further data point, if I compile with OpenBLAS and use Jeff's patch to define HAVE_ZGEMM3M instead of HAVE_MKL_H (not sure he hit every instance of it, though), then at least the hf_sto3g_relfci_{coulomb,breit,gaunt} test cases pass.

All the second-order CASSCF test cases still look totally off though, e.g. for hf_svp_second_coulomb.json:

     12        -99.79841319          0.00000014           0.10

    * SCF iteration converged.

    * Permanent dipole moment: Dirac Hartree-Fock
           (   -0.000000,    -0.000000,     1.363962) a.u.
  ---------------------------
      CASSCF calculation     
  ---------------------------

  *** Geometry (Relativistic) ***
       - 3-index ints post                         0.00
       - 3-index ints prep                         0.00
       - 3-index ints                              0.02
       - 3-index ints post                         0.00

       - Geometry relativistic (total)             0.02

    * nclosed  :      4
    * nact     :      2
    * nvirt    :     32
    * gaunt    : false
    * breit    : false
    * active space: 2 electrons in 2 orbitals
    * time-reversal symmetry will be assumed.
       - Coulomb: half trans                       0.01
       - Coulomb: metric multiply                  0.04
       - Coulomb: J operator                       0.00
       - Coulomb: K operator                       0.02
    * nstate   :      1

  === Dirac CASSCF iteration (svp) ===

   * Using the second-order algorithm

         0   0     -39171.27006450     2.37e+03      0.07

Assuming the last line is supposed to be the current variational CASSCF energy, it's complety off compared to the HF energy (-99.79841319) or the test case value (-99.90025083). From there it slowly just diverges and finally hits the max iteration exception.

jeffhammond commented 7 years ago

I hope to obviate ZDOT_RETURN in the near future. See https://github.com/jeffhammond/bagel/issues/2 for details. (I didn't want to create the issue in your repo because it's not a bug.)

mbanck commented 7 years ago

Ok, great. In the meantime, I am interested if anybody managed to get BAGEL built and tested without MKL.

shiozaki commented 7 years ago

On my laptop, it works with apples default BLAS/Lapack. Many people BAGEL use, and I never heard of your problem.

jeffhammond commented 7 years ago

I started testing on Linux with GCC6 + some free BLAS (can't remember) but had issues. I will try to figure them out and submit fixes as appropriate.

shiozaki commented 7 years ago

Closing for now (as we wrote in Wiki that MKL is very strongly recommended). If this problem persists with MKL, let me know.