svenreiche / Genesis-1.3-Version4

Time-dependent, 3D Code to simulate the amplification process of a Free-electron Laser.
GNU General Public License v3.0
55 stars 27 forks source link

[BUG] #106

Closed dcesar13 closed 1 year ago

dcesar13 commented 1 year ago

Describe the bug

When compiling with modern compiler I find an assertion error /src/main/Undulator.cpp #54 because vector z is smaller than vector marker. The vectors are defined in /src/Lattice/Lattice.cpp (~line #84) as und->z.resize(ndata);, and und->marker.resize(ndata+1);.

I can solve this by changing the loop in Undulator.cpp to run to size.marker()-1, which I think is probably correct since lastmark is set in Lattice.cpp. It is also possible to make z size ndata+1, since nothing calls on z.size().

To Reproduce Steps to reproduce the behavior:

  1. Compile with modern compiler
  2. Run genesis with external lattice file (or perhaps without, I haven't tried to check)
  3. Observe runtime error

Expected behavior I expected the code to run without runtime errors after sucssfully compiling.

Additional context I am using:

[dcesar@sdfiana006 test_sxr_lclsii]$ h5pcc --version gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-15) Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Weirdly, I don't get the assertion error if I don't use h5pcc (edit the CMakeLists.txt to not find h5pcc, though it still finds mpi and hdf5) then I don't get the error. So the assertion error must be the result of some setting.

h5pcc --showme gcc -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -L/usr/lib64/openmpi/lib /usr/lib64/openmpi/lib/libhdf5_hl.a /usr/lib64/openmpi/lib/libhdf5.a -Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -fPIC -Wl,-z,now -Wl,--as-needed -lsz -lz -ldl -lm -Wl,-rpath -Wl,/usr/lib64/openmpi/lib -I/usr/include/openmpi-x86_64 -pthread -Wl,-rpath -Wl,/usr/lib64/openmpi/lib -Wl,--enable-new-dtags -L/usr/lib64/openmpi/lib -lmpi

dcesar13 commented 1 year ago

Also, I encountered this problem by helping my compiler find h5pcc in CMakeLists.txt

`execute_process( COMMAND which h5pcc OUTPUT_VARIABLE H5PCC_PATH OUTPUT_STRIP_TRAILING_WHITESPACE )

if(EXISTS "${H5PCC_PATH}") set(CMAKE_CXX_COMPILER ${H5PCC_PATH}) endif() `

In benchmarks using the h5pcc (and its options) to compile, vs the default c++ the CMakeLists.txt grabs leads a factor of 2 in speed (28 seconds using 180 cores for a SXR example at LCLS, vs 62 seconds).

ZeugAusHH commented 1 year ago

Many thanks for reporting this issue.

What arguments did you use when running cmake?

Which version of the source code did you use? If available please provide the git commit id (it is reported at the beginning of the simulation run).

dcesar13 commented 1 year ago

Genesis reports the hash as this: Compile info: Compiled by dcesar at 2023-09-21 22:13:30 [UTC] from Git Commit ID: 5978fb9261fec9ce7d53267853d16beb7649be3c Which I look up: [dcesar@sdfiana007 build2]$ git show 5978fb9261fec9ce7d53267853d16beb7649be3c commit 5978fb9261fec9ce7d53267853d16beb7649be3c Merge: 5ee051b 589a12a

The relevant part is 589a12a, which I think is the current commit on this repo. I did merge my branch about 24 hours ago.

Besides my comment above about pointing to h5pcc, I used no extra arguments for cmake. The stdout from cmake is below (when it finds h5pcc ... if I comment out the h5pcc lines, then it gives different output, and the code will run fine, even without fixing the bug in length of the vectors, but the code will also be 2-3x slower on our cluster): -- The CXX compiler identification is GNU 8.5.0 -- The C compiler identification is GNU 8.5.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- h5pcc compiler wrapper found - No additional libraries needed -- Found PkgConfig: /usr/bin/pkg-config (found version "1.4.2") -- FFTW found -- Build Info: -- C++ Compiler: /usr/lib64/openmpi/bin/h5pcc -- C Compiler: /usr/bin/cc -- CMAKE_C_FLAGS: -- CMAKE_CXX_FLAGS: '-lstdc++' -- Used libraries: PkgConfig::FFTW -- Defining the macro FFTW in the source code -- Configuring done -- Generating done -- Build files have been written to: /sdf/group/ad/beamphysics/software/genesis4/milano_openmpi/genesis4/build2

ZeugAusHH commented 1 year ago

One remark concerning the performance differences you report for the cases with/without h5pcc.

Here we have CentOS 7.9.2009 and I typically use gcc 9.3 as compiler. To see the calls to the compiler while building, you have to enable the line set(CMAKE_VERBOSE_MAKEFILE ON) in CMakeLists.txt (and also disable the line 7 setting this flag to OFF).

When I use

cmake -DCMAKE_BUILD_TYPE=Release ..

the calls to GNU c++ include the -O3 option for optimization. When I use just cmake .. the c++ calls are without any -O-type argument.

Your h5pcc wrapper invokes gcc with -O2, so also optimized.

svenreiche commented 1 year ago

The difference in the execution is defined if CMAKE builds for a release or debug.

However I tested compiling with gcc, h5pcc, cc and clang and I cannot reproduce the assertion error (or runtime error). Can you point to the line in Undulator.cpp where this occurs?

ZeugAusHH commented 1 year ago

Dear Sven,

I was able to reproduce the issue (or at least a related issue?) using the source code from git commit ID 589a12ac9db3c4e8aa6d237b2dcc850f282cd995.

To locate it, I replaced in line 59 of src/Core/Undulator.cpp the z[i] by the corresponding z.at(i). This results in a core dump (for the test I did a time-independent simulation) since there is an access beyond the reserved array length.

I used the following CentOS modules to build my binary: gcc/9.3, openmpi/3.1.6, and hdf5/1.10.6. I ran cmake with the switch -DCMAKE_BUILD_TYPE=Debug.

Best Christoph

Remark: Edited first sentence for clarity.

svenreiche commented 1 year ago

The bug has been fixed with the new release (as well as latest commit of dev branch)