Open vmos1 opened 1 year ago
Can you please
i) recompile with configure flags including --enable-debug
ii) rerun on a single MPI rank the same volume, using a cold start if necessary.
iii) rerun it under gdb interactively. This core dump should become trapped and you can type "backtrace"
and find out the line of code and hopefully the problem. You can print variables in the local file with print
Recompiled with --enable-debug
Ran on a single MPI rank -> code works fine. Repeating with 2 ranks causes same failure as above.
The rng
file is written, but the issue occurs while writing the lat
file, which is much bigger.
Using gdb for coredump doesn't yield anything. "backtrace" gives "No stack"
Any idea why this could only happen for ILDG (not NERSC format) on multiple ranks only ?
I'm testing an HMC workflow with the ILDG checkpointer The sample code can be accessed here The code runs well with Nersc checkpointer used as :
TheHMC.Resources.LoadNerscCheckpointer(CPparams);
It fails when using ILCGCheckpointer as:
TheHMC.Resources.LoadILDGCheckpointer(CPparams);
The code runs until the first checkpoint, then I get a 'core' file and the following errors:The last few lines of the output are :
Have replicated the error on Crusher (ORNL) and Tioga(LLNL) AMD machines.
Building Grid: For building Grid, I use the standard procedure with lime, documented here