Open boegel opened 1 week ago
Thank you very much for taking the time for writing this. I tested NECI, commit https://github.com/ghb24/NECI_STABLE/commit/558e88c5ae6c30d0505a9badbc69111be0866ba1, by compiling it with GCC 12.3.0, OpenMPI 4.2.5, on both Intel and AMD hardwares. I could not reproduce the issue.
I compile the program by using:
cmake -DCMAKE_Fortran_COMPILER=mpifort -DCMAKE_CXX_COMPILER=mpic++ -DCMAKE_C_COMPILER=mpicc /path/to/source
Could please tell me the following info?
@demetriovilardi Thank you for looking into this.
We saw the segfaults happen on both RHEL 8.8 and 9.4 systems, across a range of CPU types (different generations of Intel & AMD). OpenMPI version we're using for this is v4.1.5. gfortran version is 12.3.0 (we install our own GCC)
We build NECI via EasyBuild, which uses the following cmake
command:
cmake -DCMAKE_INSTALL_PREFIX=/software/NECI/20230620-foss-2023a -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DPYTHON_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DPython3_EXECUTABLE=/software/Python/3.11.3-GCCcore-12.3.0/bin/python -DENABLE_HDF5=ON
In addition, we set $CFLAGS
, $CXXFLAGS
and $F90FLAGS
to "-O2 -ftree-vectorize -march=native -fno-math-errno
" (and cmake
should pick up on this)
In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)
Just a quick comment on this one.¹ NECI anyway uses -O3 -march=native
for Release builds and -ftree-vectorize
is implied with O3
; is there a specific reason you want to avoid O3?
@demetriovilardi it seems like a good idea to also add -fno-math-errno
to Release builds, isn't it?
¹ I changed groups and am not anymore one of the maintainers.
In addition, we set $CFLAGS, $CXXFLAGS and $F90FLAGS to "-O2 -ftree-vectorize -march=native -fno-math-errno" (and cmake should pick up on this)
Just a quick comment on this one.¹ NECI anyway uses
-O3 -march=native
for Release builds and-ftree-vectorize
is implied withO3
; is there a specific reason you want to avoid O3?
This has grown historically I guess: -O2
used to be a lot more stable in terms of generating correct machine instructions than -O3
, and at some point we noticed a significant benefit of enabling -ftree-vectorize
.
There may still be some compiler optimizations enabled with -O3
(and not in -O2
) that could have a negative side effect, like on floating-point accuracy for example.
After analysing the test with valgrind, we noticed an issue about a missing initialisation of a variable inside the test back_spawn_excit_gen.F90
, that caused a memory problem at the line you specified. This issue doesn't affect the main program and we will fix it soon on our internal repository.
@demetriovilardi Can you point us to the details of the fix, so we can apply a patch for this and have a test suite that passes in full?
Essentially, inside unit_tests/back_spawn_excit_gen/test_back_spawn_excit_gen.F90
, the array KPointToBasisFn
is not initialised after allocation, in test_back_spawn_excit_gen.F90:66
and test_back_spawn_excit_gen.F90:405
. One can add
KPointToBasisFn = -1
one line after allocation in both cases. After this change I get no further valgrind warnings or errors.
@PetrKralCZ Can you look into making a patch file for NECI based on this info, and see if that fixes the segfaults we were seeing in the test suite?
When building the latest version of NECI (commit 558e88c5ae6c30d0505a9badbc69111be0866ba1) with GCC 12.3.0, we're consistently seeing segfaults in the tests, even when a lot of memory is available on the system (~185GB, with 36 cores).
More details on
test_neci_back_spawn_excit_gen
below incl. backtrace, the actual full error is "Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
" (other tests fail in a very similar way, and always inback_spawn.F90:593
):Are there any known problems when NECI is built with a recent compiler? Any suggestions on how to get to the bottom of this, could it be a bug in
back_spawn.F90
?Although we've seen very similar problems before with an older version of NECI and GCC 11.3.0 (see https://github.com/easybuilders/easybuild-easyconfigs/issues/17164), we didn't observe these problems when using GCC 12.2.0 to build NECI commit 558e88c5ae6c30d0505a9badbc69111be0866ba1, which seems strange to me...