svenreiche / Genesis-1.3-Version4

Time-dependent, 3D Code to simulate the amplification process of a Free-electron Laser.
GNU General Public License v3.0
53 stars 26 forks source link

[BUG] memory corruption caused by sporadic array writes with index=-1 in Collective.cpp #144

Open ZeugAusHH opened 7 months ago

ZeugAusHH commented 7 months ago

Some context

To put this into context, this is my simulation setup:

The issue

Investigation using the source code in commit id 1dfe13b revealed that in the function Collective::update there are sporadic accesses outside of allocated memory, namely writes into arrays count and wakeint with idx=-1 (!), resulting in a write just before the allocated memory region. The reason for this is that the computation of variable idx in line 206 of Collective.cpp sometimes results in idx=-1. In these cases the argument to floor function is negative on the order of just some 1E-11. This is likely the result of rounding errors, but I have not studied this.

Workaround

To verify that this is actually the reason for the crashes, I added code to Collective.cpp that forces idx to zero if the argument to floor was negative by just a small amount. Of course the following should not be merged, it is just for demonstration: commit https://github.com/ZeugAusHH/Genesis-1.3-Version4/commit/28dd1fbf95c5f0759bca37608a8377fd96d5fccd in branch https://github.com/ZeugAusHH/Genesis-1.3-Version4/tree/cl_20240112__crashinvestigation . With this workaround the simulation runs are completed successfully.

Solution

If possible the code block should be reworked to use integer arithmetic (and avoid floating point arithmetic) as much as possible to avoid these rounding issues. Moreover, one should consider migration to std::vector and accesses via at in this code block (performance penalty should be small).

ZeugAusHH commented 7 months ago

Here is a minimum working example demonstrating the issue.

All runs were done with source code as of git commit id 8df7998 (that is a commit 8df7998 in my branch https://github.com/ZeugAusHH/Genesis-1.3-Version4/tree/cl_20240115__crashinvestigation ).

Here are a few examples for running with different mpisizes (always running 32 processes per E5-2698 node): mpisize=96 (3x32): Stats for Collective::update: Total workaround interventions: 300, total calls: 576 mpisize=128 (4x32): Stats for Collective::update: Total workaround interventions: 0, total calls: 768 mpisize=480 (15x32): Stats for Collective::update: Total workaround interventions: 1290, total calls: 2880 mpisize=512 (16x32): Stats for Collective::update: Total workaround interventions: 0, total calls: 3072

The cases with non-zero "Total workaround interventions" count would in the case of the un-patched code perform array writes with idx=-1. These writes corrupt memory and likely eventually result in a crash.

More output for mpisize=96:

=================================
G4 Input File:      mwe.in
MPI processes/node: 32
mpirun -N 32 --mca pml ucx ./genesis4 mwe.in
---------------------------------------------
GENESIS - Version 4.6.6 (beta) has started...
Compile info: Compiled by lechnerc at 2024-01-15 13:06:38 [UTC] from Git Commit ID: 8df7998fd9fdc70b6a5831de249a31263996f2c4
Starting Time: Tue Jan 16 11:42:35 2024

MPI-Comm Size: 96 nodes

Opened input file mwe.in
Parsing lattice file mwe.lat ...
Setting up time window of 13.8009 microns with 74208 sample points...
Generating input radiation field for HARM = 1 ...
Generating input radiation field for HARM = 3 ...
Adding profile with label: beamimported.gamma
Adding profile with label: beamimported.delgam
Adding profile with label: beamimported.current
Adding profile with label: beamimported.ex
Adding profile with label: beamimported.ey
Adding profile with label: beamimported.betax
Adding profile with label: beamimported.betay
Adding profile with label: beamimported.alphax
Adding profile with label: beamimported.alphay
Adding profile with label: beamimported.xcenter
Adding profile with label: beamimported.ycenter
Adding profile with label: beamimported.pxcenter
Adding profile with label: beamimported.pycenter
Adding profile with label: beamimported.bunch
Adding profile with label: beamimported.bunchphase
Adding profile with label: beamimported.emod
Adding profile with label: beamimported.emodphase
Generating input particle distribution...
Adding profile with label: prof_w
Generating wakefield potentials...

Running Core Simulation...
Time-dependent run with 74208 slices for a time window of 13.8009 microns
Initial analysis of electron beam and radiation field...
  Calculation: 0% done
workaround/hack in Collective.cpp: set idx=0, floorarg=-9.10904e-12
workaround/hack in Collective.cpp: set idx=0, floorarg=-9.10904e-12
workaround/hack in Collective.cpp: set idx=0, floorarg=-9.10904e-12
workaround/hack in Collective.cpp: set idx=0, floorarg=-9.10904e-12
workaround/hack in Collective.cpp: set idx=0, floorarg=-9.10904e-12

[..]

workaround/hack in Collective.cpp: set idx=0, floorarg=-4.55452e-12
workaround/hack in Collective.cpp: set idx=0, floorarg=-4.55452e-12
  Calculation: 80% done
Calculation terminated due to requested stop.
Sorting...
Global Sorting: Slicelength: 18.8496 - Send backwards for theta < 0 - Send forward for theta > 14570.7
Sorting: Transferring 934 particles to other nodes at iteration 1 (largest single transfer contains 21 particles)
Info: globalSort complete, largest transfer was 126 doubles
Writing output file...
   INFO: debug option to suppress writing of .out.h5 file is set

Core Simulation done.
End of Track
Stats for Collective::update: Total workaround interventions: 300, total calls: 576

generating semaphore file simout//sase-scu-h2.sema

Program is terminating...
Ending Time: Tue Jan 16 11:51:41 2024
Total Wall Clock Time: 545.49 seconds
-------------------------------------

20240116__mwe.tar.gz