paboyle / Grid

Data parallel C++ mathematical object library
GNU General Public License v2.0
155 stars 111 forks source link

Using ILDG checkpointer causes a crash during write #423

Open vmos1 opened 1 year ago

vmos1 commented 1 year ago

I'm testing an HMC workflow with the ILDG checkpointer The sample code can be accessed here The code runs well with Nersc checkpointer used as : TheHMC.Resources.LoadNerscCheckpointer(CPparams);

It fails when using ILCGCheckpointer as: TheHMC.Resources.LoadILDGCheckpointer(CPparams);   The code runs until the first checkpoint, then I get a 'core' file and the following errors:


hmc_SDM: 
/grid_prefix/include/Grid/parallelIO/IldgIO.h:616: void Grid::IldgWriter::writeLimeIldgLFN(std::string &): Assertion `err>=0' failed.
srun: error: tioga13: tasks 1-7: Aborted (core dumped) 

The last few lines of the output are :

Grid : Message : 267.892527 s : IOobject:  write 3328 bytes in 0.104757 s 0.0302971 MB/s 
Grid : Message : 267.892535 s : IOobject: endian and checksum overhead 0.000015 s
Grid : Message : 267.892537 s : RNG file checksum 4dc54934
Grid : Message : 267.892538 s : RNG file checksuma 8a569ac0
Grid : Message : 267.892539 s : RNG file checksumb 447f5161
Grid : Message : 267.892540 s : RNG state overhead 0.002102 s

  Have replicated the error on Crusher (ORNL) and Tioga(LLNL) AMD machines.

Building Grid: For building Grid, I use the standard procedure with lime, documented here

paboyle commented 1 year ago

Can you please i) recompile with configure flags including --enable-debug ii) rerun on a single MPI rank the same volume, using a cold start if necessary. iii) rerun it under gdb interactively. This core dump should become trapped and you can type "backtrace" and find out the line of code and hopefully the problem. You can print variables in the local file with print if necessary.

vmos1 commented 1 year ago

Recompiled with --enable-debug Ran on a single MPI rank -> code works fine. Repeating with 2 ranks causes same failure as above. The rng file is written, but the issue occurs while writing the lat file, which is much bigger.

Using gdb for coredump doesn't yield anything. "backtrace" gives "No stack"

Any idea why this could only happen for ILDG (not NERSC format) on multiple ranks only ?