sanshar / Dice

Other
43 stars 18 forks source link

Tests failing #18

Open lorisercole opened 10 months ago

lorisercole commented 10 months ago

Default tests

I have run the tests on 4 MPI tasks (as hardcoded in the tests scripts), and they all pass flawlessly, except for the DQMC/multislater_ghf_gi one, for which I get:

...running DQMC/multislater_ghf_gi
DQMC: ./eigen/Eigen/src/Core/Block.h:146: Eigen::Block<XprType, BlockRows, BlockCols, InnerPanel>::Block(XprType&, Eigen::Index, Eigen::Index, Eigen::Index, Eigen::Index) [with XprType = Eigen::Map<Eigen::Matrix<double, -1, -1>, 0, Eigen::Stride<0, 0> >; int BlockRows = -1; int BlockCols = -1; bool InnerPanel = false; Eigen::Index = long int]: Assertion `startRow >= 0 && blockRows >= 0 && startRow <= xpr.rows() - blockRows && startCol >= 0 && blockCols >= 0 && startCol <= xpr.cols() - blockCols' failed.
[std-hb2-pg0-9:432066] *** Process received signal ***
[std-hb2-pg0-9:432066] Signal: Aborted (6)
[std-hb2-pg0-9:432066] Signal code:  (-6)
[std-hb2-pg0-9:432066] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7ff8d06d6520]
[std-hb2-pg0-9:432066] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7ff8d072a9fc]
[std-hb2-pg0-9:432066] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7ff8d06d6476]
[std-hb2-pg0-9:432066] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7ff8d06bc7f3]
[std-hb2-pg0-9:432066] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7ff8d06bc71b]
[std-hb2-pg0-9:432066] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7ff8d06cde96]
[std-hb2-pg0-9:432066] [ 6] DQMC(+0x15170e)[0x55dc226a270e]
[std-hb2-pg0-9:432066] [ 7] DQMC(+0x1e5e17)[0x55dc22736e17]
[std-hb2-pg0-9:432066] [ 8] DQMC(+0x2e548)[0x55dc2257f548]
[std-hb2-pg0-9:432066] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7ff8d06bdd90]
[std-hb2-pg0-9:432066] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7ff8d06bde40]
[std-hb2-pg0-9:432066] [11] DQMC(+0x2eb85)[0x55dc2257fb85]
[std-hb2-pg0-9:432066] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node DESKTOP-HSCRDM6 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Different number of MPI tasks

When I tried to do the same on a different number of MPI tasks, for example 2, 6, 8, 16, most (or all) tests fail. Often the energy difference with respect to the reference value can be of the order of 0.1 or 0.01, other times it is of the order of 0.001, so well above the set tolerances (1e-6 or 1e-7).

Example test output
## for practical purposes, I modified the `testEnergy.py` script to print the energy differences.
[lercole@std-hb2-pg0-9 tests]$ sed -i 's/mpirun -np 4/mpirun -np 8/g' run*sh
[lercole@std-hb2-pg0-9 tests]$ ./runTests.sh

Running Tests for VMC/GFMC/NEVPT2/FCIQMC/DQMC
======================================================

Running Tests for VMC
======================================================
...running hubbard_1x10
test failed
  eTest - eRef = 1.8742e-02
  eRef =  -4.17498229
  eTest =  -4.15624069
...running hubbard_1x10 ghf
test failed
  eTest - eRef = -3.5035e-02
  eRef =  -5.42482593
  eTest =  -5.45986127
...running hubbard_1x10 agp
test failed
  eTest - eRef = -2.5273e-02
  eRef =  -3.83213158
  eTest =  -3.85740436
...running hubbard_1x14
test failed
  eTest - eRef = 1.7223e-02
  eRef =  -10.78302325
  eTest =  -10.76580073
...running hubbard_1x22
test failed
  eTest - eRef = -1.2980e-02
  eRef =  -7.95585667
  eTest =  -7.96883698
...running hubbard_1x50
test failed
  eTest - eRef = 6.0798e-02
  eRef =  -38.80608576
  eTest =  -38.74528821
...running hubbard_1x6
test passed
  eTest - eRef = 0.0000e+00
...running hubbard_18_tilt uhf
test failed
  eTest - eRef = -9.6450e-02
  eRef =  -16.29056087
  eTest =  -16.38701062
...running h4 ghf complex
test failed
  eTest - eRef = -1.3367e-03
  eRef =  -2.14652278
  eTest =  -2.14785948
...running h4 pfaffian complex
test failed
  eTest - eRef = -2.5898e-03
  eRef =  -2.14267989
  eTest =  -2.1452697
...running h10 pfaffian
test failed
  eTest - eRef = -1.7162e-03
  eRef =  -5.19471654
  eTest =  -5.19643273
...running h20
test failed
  eTest - eRef = -9.8264e-03
  eRef =  -7.06168529
  eTest =  -7.07151173
...running h20 ghf
test failed
  eTest - eRef = 3.6540e-03
  eRef =  -10.28553672
  eTest =  -10.28188272
...running c2
test failed
  eTest - eRef = 2.7509e-03
  eRef =  -74.55055844
  eTest =  -74.5478075

Running Tests for GFMC
======================================================
...running hubbard_18_tilt uhf
test failed
  eTest - eRef = -9.6450e-02
  eRef =  -16.29056087
  eTest =  -16.38701062
...running hubbard_18_tilt gfmc
test failed
  eTest - eRef = -3.1810e-02
  eRef =  -16.88259405
  eTest =  -16.9144044

Running Tests for NEVPT2
======================================================
...running NEVPT2/n2_vdz/stoch
test failed
  eTest - eRef = 4.0540e-04
  eRef =  -109.1846287
  eTest =  -109.1842233
...running NEVPT2/n2_vdz/continue_norms PRINT
test failed
  eTest - eRef = 7.6540e-04
  eRef =  -109.183194
  eTest =  -109.1824286
...running NEVPT2/n2_vdz/continue_norms READ
test failed
  eTest - eRef = 7.6750e-04
  eRef =  -109.1825321
  eTest =  -109.1817646
...running NEVPT2/n2_vdz/exact_energies PRINT
test passed
  eTest - eRef = 0.0000e+00
...running NEVPT2/n2_vdz/exact_energies READ
test passed
  eTest - eRef = 0.0000e+00
...running NEVPT2/h4_631g/determ
test passed
  eTest - eRef = 0.0000e+00
...running NEVPT2/polyacetylene/stoch
test failed
  eTest - eRef = -1.6040e-04
  eRef =  -155.1823833
  eTest =  -155.1825437
...running NEVPT2/n2_vdz/single_perturber
determ test passed
stoch test passed

Running Tests for FCIQMC
======================================================
...running FCIQMC/He2
test failed
  eTest - eRef = 1.1937e-03
  eRef =  -5.762943084279232
  eTest =  -5.761749337716283
...running FCIQMC/He2_hb_uniform
test failed
  eTest - eRef = 1.2809e-04
  eRef =  -5.762337845140905
  eTest =  -5.762209752789398
...running FCIQMC/Ne_plateau
test failed
  eTest - eRef = 2.9082e-04
  eRef =  -128.70958249279593
  eTest =  -128.70929167168606
...running FCIQMC/Ne_initiator
test failed
  eTest - eRef = 1.2952e-02
  eRef =  -128.72525652060892
  eTest =  -128.71230417093656
...running FCIQMC/Ne_initiator_replica
test failed
eRef1 =  -128.70570937888726
eTest1 =  -128.71161315222963
eRef2 =  -128.70849149597655
eTest2 =  -128.70353835294057
eRefVar =  -128.54030751935989
eTestVar =  -128.37441541385843
eRefEN2 =  0.0
eTestEN2 =  0.0
...running FCIQMC/Ne_initiator_en2
test failed
eRef1 =  -128.70745111511025
eTest1 =  -128.704079503759
eRef2 =  -128.70728765356745
eTest2 =  -128.70076936736214
eRefVar =  -128.88505380643883
eTestVar =  -128.95491510237173
eRefEN2 =  -0.020670887982356057
eTestEN2 =  0.00579960323366854
...running FCIQMC/Ne_initiator_en2_ss
test failed
eRef1 =  -128.70928156824655
eTest1 =  -128.70948917426213
eRef2 =  -128.70965980981856
eTest2 =  -128.71001418429785
eRefVar =  -128.70575284782976
eTestVar =  -128.70536975709993
eRefEN2 =  0.0011977728994427303
eTestEN2 =  -0.004109782185213983
...running FCIQMC/water_vdz_hb
test failed
  eTest - eRef = -1.0011e-03
  eRef =  -76.24055896137513
  eTest =  -76.24156003663879

Running Tests for AFQMC
======================================================
...running DQMC/rhf_rhf
test failed
  eTest - eRef = 4.2476e-03
  eRef =  -76.121061333
  eTest =  -76.11681368910571
  wTest - wRef = 7.9938e+01
  wRef =  80.019399
  wTest =  159.9578402339277
...running DQMC/rhf_uhf
test failed
  eTest - eRef = -1.1286e-02
  eRef =  -5.3551361163
  eTest =  -5.366422420038034
  wTest - wRef = 7.9821e+01
  wRef =  79.869323
  wTest =  159.6899502970639
...running DQMC/uhf_rhf
test failed
  eTest - eRef = -5.5420e-03
  eRef =  -5.3694725281
  eTest =  -5.375014518745735
  wTest - wRef = 7.9929e+01
  wRef =  79.914801
  wTest =  159.8441082913229
...running DQMC/uhf_uhf
test failed
  eTest - eRef = 1.7012e-03
  eRef =  -75.687851842
  eTest =  -75.68615062549716
  wTest - wRef = 7.9981e+01
  wRef =  79.938639
  wTest =  159.919893886195
...running DQMC/multislater_rhf
test failed
  eTest - eRef = -1.5162e-03
  eRef =  -109.09511182
  eTest =  -109.0966280240538
  wTest - wRef = 1.9942e+01
  wRef =  19.984035
  wTest =  39.92619147592583
...running DQMC/multislater_uhf
test failed
  eTest - eRef = -3.4923e-03
  eRef =  -37.753007223
  eTest =  -37.75649954823818
  wTest - wRef = 7.9963e+01
  wRef =  79.787989
  wTest =  159.750495666826
...running DQMC/ghf_ghf_soc
test failed
  eTest - eRef = 4.7815e-02
  eRef =  -153.33796367
  eTest =  -153.2901486912503
  wTest - wRef = 1.9919e+01
  wRef =  19.870107
  wTest =  39.78864245446707
...running DQMC/uhf_uhf_ui
test failed
  eTest - eRef = 1.5279e-03
  eRef =  -3.0949887079
  eTest =  -3.093460783634393
  wTest - wRef = 8.0021e+01
  wRef =  80.039541
  wTest =  160.0602107374257
...running DQMC/multislater_uhf_ui
test failed
  eTest - eRef = 1.0254e-03
  eRef =  -75.683616692
  eTest =  -75.68259127061897
  wTest - wRef = 7.9847e+01
  wRef =  79.855562
  wTest =  159.7022658597608
...running DQMC/ghf_ghf_gi
test failed
  eTest - eRef = 2.5796e-03
  eRef =  -1.430762509
  eTest =  -1.428182921908405
  wTest - wRef = 7.9826e+01
  wRef =  79.864586
  wTest =  159.6908960563758
...running DQMC/multislater_ghf_gi
DQMC: /anfhome/spack/opt/spack/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placeholder__/__spack_path_placehold/linux-almalinux8-zen2/gcc-13.2.0/eigen-3.4.0-vhwiejcim3wl4uwfktlwdoxazb3ejmyl/include/eigen3/Eigen/src/Core/Block.h:146: Eigen::Block::Block(XprType&, Eigen::Index, Eigen::Index, Eigen::Index, Eigen::Index) [with XprType = Eigen::Map, 0, Eigen::Stride<0, 0> >; int BlockRows = -1; int BlockCols = -1; bool InnerPanel = false; Eigen::Index = long int]: Assertion `startRow >= 0 && blockRows >= 0 && startRow <= xpr.rows() - blockRows && startCol >= 0 && blockCols >= 0 && startCol <= xpr.cols() - blockCols' failed.
[std-hb2-pg0-9:76043] *** Process received signal ***
[std-hb2-pg0-9:76043] Signal: Aborted (6)
[std-hb2-pg0-9:76043] Signal code:  (-6)
[std-hb2-pg0-9:76043] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x15245fd0bcf0]
[std-hb2-pg0-9:76043] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x15245f982acf]
[std-hb2-pg0-9:76043] [ 2] /lib64/libc.so.6(abort+0x127)[0x15245f955ea5]
[std-hb2-pg0-9:76043] [ 3] /lib64/libc.so.6(+0x21d79)[0x15245f955d79]
[std-hb2-pg0-9:76043] [ 4] /lib64/libc.so.6(+0x47426)[0x15245f97b426]
[std-hb2-pg0-9:76043] [ 5] DQMC[0x5361ea]
[std-hb2-pg0-9:76043] [ 6] DQMC[0x5caa01]
[std-hb2-pg0-9:76043] [ 7] DQMC[0x41b7f0]
[std-hb2-pg0-9:76043] [ 8] /lib64/libc.so.6(__libc_start_main+0xe5)[0x15245f96ed85]
[std-hb2-pg0-9:76043] [ 9] DQMC[0x41bbfe]
[std-hb2-pg0-9:76043] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node std-hb2-pg0-9 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/anfhome/a-lercole/src/Dice/tests/DQMC/multislater_ghf_gi/../../testEnergy.py", line 92, in 
    fh = open('samples.dat', 'r')
FileNotFoundError: [Errno 2] No such file or directory: 'samples.dat'

I have tried building Dice with GCC 13.2, GCC 11.4, and ICC 2021.10, and I always get these inconsistencies. I am linking it with boost@1.82, hdf5@1.14.1, MKL 2023.2, and OpenMPI.

Have you ever seen this behavior and do you understand where these differences may come from? Or is within the expected statistical fluctuations due to the stochastic nature of the method? Thanks

ankit76 commented 10 months ago

@xubo-wang should know about the ghf test, it has been failing for a bit I think.

About the number of tasks, this is because the convention in our code is to increase the number of samples with an increasing number of tasks. So the sampling input options are per task, and the tests only work with four tasks.