ghb24 / NECI_STABLE

Standalone NECI codebase designed for FCIQMC and other stochastic quantum chemistry methods.
GNU General Public License v3.0
44 stars 18 forks source link

back_spawn tests failing with SEGFAULT #18

Open boegel opened 1 week ago

boegel commented 1 week ago

When building the latest version of NECI (commit 558e88c5ae6c30d0505a9badbc69111be0866ba1) with GCC 12.3.0, we're consistently seeing segfaults in the tests, even when a lot of memory is available on the system (~185GB, with 36 cores).

94% tests passed, 6 tests failed out of 96

Total Test time (real) = 372.92 sec

The following tests FAILED:
         22 - test_neci_back_spawn_excit_gen (SEGFAULT)
         23 - test_kneci_back_spawn_excit_gen (SEGFAULT)
         24 - test_dneci_back_spawn_excit_gen (SEGFAULT)
         25 - test_mneci_back_spawn_excit_gen (SEGFAULT)
         26 - test_kdneci_back_spawn_excit_gen (SEGFAULT)
         27 - test_kmneci_back_spawn_excit_gen (SEGFAULT)

More details on test_neci_back_spawn_excit_gen below incl. backtrace, the actual full error is "Program received signal SIGSEGV: Segmentation fault - invalid memory reference." (other tests fail in a very similar way, and always in back_spawn.F90:593):

``` Start 22: test_neci_back_spawn_excit_gen 22/96 Test #22: test_neci_back_spawn_excit_gen .............***Exception: SegFault 0.21 sec Test module initialized . : successful assert, F : failed assert testing: calc_pgen_back_spawn_ueg with necessary global data: nel: 2 nOccBeta: 1 nOccAlpha: 1 tUEG: T tNoFailAb: F projedet: 1 2 dSFMT_init() niftot: 2 n_int: 8 set_flag() get_initiator_flag() flag: 1 nBasis: 4 ......... testing: calc_pgen_back_spawn_hubbard testing: calc_pgen_back_spawn_hubbard with necessary global data: nel: 2 nOccBeta: 1 nOccAlpha: 1 tHub: T projedet: 1 2 dSFMT_init() niftot: 2 n_int: 8 set_flag() get_initiator_flag() IlutBits%ind_flag: 1 nBasis: 4 ....... testing: calc_pgen_back_spawn_ueg with necessary global data: nel: 2 nBasis: 4 niftot: 1 encodebitdet() dSFMT_init ...... Program received signal SIGSEGV: Segmentation fault - invalid memory reference. Backtrace for this error: #0 0x147aa063e6ef in ??? #1 0x420ebe in __back_spawn_MOD_pick_occupied_orbital_ueg at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/src/back_spawn.F90:593 #2 0x425a68 in __back_spawn_excit_gen_MOD_calc_pgen_back_spawn_ueg_new at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/src/back_spawn_excit_gen.F90:284 #3 0x407c4e in calc_pgen_back_spawn_ueg_new_test at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:487 #4 0x785d3d in __fruit_MOD_run_test_case_named_ at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/fruit_src/fruit.f90:909 #5 0x407108 in back_spawn_excit_gen_test_driver at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:29 #6 0x407108 in test_back_spawn_excit_gen at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:16 #7 0x407108 in main at /tmp/easybuild_build/NECI/20230620/foss-2023a/NECI_STABLE/unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90:5 ```

Are there any known problems when NECI is built with a recent compiler? Any suggestions on how to get to the bottom of this, could it be a bug in back_spawn.F90?

Although we've seen very similar problems before with an older version of NECI and GCC 11.3.0 (see https://github.com/easybuilders/easybuild-easyconfigs/issues/17164), we didn't observe these problems when using GCC 12.2.0 to build NECI commit 558e88c5ae6c30d0505a9badbc69111be0866ba1, which seems strange to me...

demetriovilardi commented 1 week ago

Thank you very much for taking the time for writing this. I tested NECI, commit https://github.com/ghb24/NECI_STABLE/commit/558e88c5ae6c30d0505a9badbc69111be0866ba1, by compiling it with GCC 12.3.0, OpenMPI 4.2.5, on both Intel and AMD hardwares. I could not reproduce the issue.

I compile the program by using: cmake -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_CXX_COMPILER=mpic++ -DCMAKE_C_COMPILER=mpicc /path/to/source

Could please tell me the following info?

boegel commented 6 days ago

@demetriovilardi Thank you for looking into this.

We saw the segfaults happen on both RHEL 8.8 and 9.4 systems, across a range of CPU types (different generations of Intel & AMD). OpenMPI version we're using for this is v4.1.5. gfortran version is 12.3.0 (we install our own GCC)

We build NECI via EasyBuild, which uses the following cmake command:

cmake  -DCMAKE_INSTALL_PREFIX=/software/NECI/20230620-foss-2023a -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DPYTHON_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython3_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DENABLE_HDF5=ON

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

mcocdawc commented 6 days ago

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

Just a quick comment on this one.¹ NECI anyway uses -O3 -march=native for Release builds and -ftree-vectorize is implied with O3; is there a specific reason you want to avoid O3?

@demetriovilardi it seems like a good idea to also add -fno-math-errno to Release builds, isn't it?

¹ I changed groups and am not anymore one of the maintainers.

boegel commented 5 days ago

In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)

Just a quick comment on this one.¹ NECI anyway uses -O3 -march=native for Release builds and -ftree-vectorize is implied with O3; is there a specific reason you want to avoid O3?

This has grown historically I guess: -O2 used to be a lot more stable in terms of generating correct machine instructions than -O3, and at some point we noticed a significant benefit of enabling -ftree-vectorize. There may still be some compiler optimizations enabled with -O3 (and not in -O2) that could have a negative side effect, like on floating-point accuracy for example.

demetriovilardi commented 1 day ago

After analysing the test with valgrind, we noticed an issue about a missing initialisation of a variable inside the test back_spawn_excit_gen.F90, that caused a memory problem at the line you specified. This issue doesn't affect the main program and we will fix it soon on our internal repository.

boegel commented 1 day ago

@demetriovilardi Can you point us to the details of the fix, so we can apply a patch for this and have a test suite that passes in full?

demetriovilardi commented 1 day ago

Essentially, inside unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90, the array KPointToBasisFn is not initialised after allocation, in test_back_spawn_excit_gen.F90:66 and test_back_spawn_excit_gen.F90:405. One can add

KPointToBasisFn = -1

one line after allocation in both cases. After this change I get no further valgrind warnings or errors.

boegel commented 1 day ago

@PetrKralCZ Can you look into making a patch file for NECI based on this info, and see if that fixes the segfaults we were seeing in the test suite?