Graceful termination of MOOSE with files or signals

andrewritzmann commented 5 years ago

Rationale

Gracefully terminating MOOSE calculations using a SIGNAL or flag file would be extremely useful for avoiding corrupted output files when MOOSE when a user is forced to use a hard kill to terminate a calculation. This would ensure proper checkpoint files and avoid corrupting any data files used for visualization. While sizing jobs to fit within a cluster allocation is desirable, the fact remains that nonlinear convergence rates can change as the calculation progresses and a best estimate for the end time may turn out to be inappropriate. This also helps users avoid timeouts which could erase data from local scratch directories.

I strongly encourage the use of a SIGNAL because both PBS (through the qsig command) and SLURM (through, e.g., the scancel command or #SBATCH --signal=... in the submit script). If MOOSE terminates cleanly, then the user's submit script can move files as needed before the job times out.

Description

This is an enhancement request. For comparison, other codes check for files or file modifications at each time step. Examples:

VASP (Vienna ab Initio Simulation Package) allows users to place a file called STOPCAR in the execution directory that acts as a flag to stop the calculation. It can be used to terminate in different ways depending on the time remaining and the calculation at hand. (https://cms.mpi.univie.ac.at/wiki/index.php/STOPCAR)
openfoam where the user can alter the stopAt flag to read "writeNow" (https://cfd.direct/openfoam/user-guide/v6-controldict/) and the calculation will respond at the next iteration.
nwchem has the option to integrate libslurm allowing it to query the queueing system and make an educated guess as to whether it can finish the next step in the calculation. It terminates with a message if it determines that it cannot complete the requested calculation.

Impact

This should impact most strongly the internal API of moose. The main execution loop would need to handle checking for the termination mechanism and initiate the appropriate process. I am not sure how this would influence the multiapp infrastructure. Documentation of this new feature would be required on the wiki to indicate how the user can make use of it.

Attachment

I have attached the relevant conversation from the moose-users list between myself, Jacob Bair, and Cody Permann. moose-users_thread_with_Cody_Permann.txt

friedmud commented 5 years ago

Exception handling is already in MOOSE... it does gracefully handle that. This couled just piggy-back on that system to handle signals during the solve. It would basically immediately "fail" the current solve and return back to the previous timestep where we could execute FINAL stuff (to get the PerfGraph, etc.) and then end.

If it happens outside of a solve (like during output) then we would just run into it at the start of the next timestep and still do the above.

Sounds good to me.

friedmud commented 1 year ago

As an addition here - once we can handle a signal and throw one of our parallel_exceptions... we should also implement the ability to write out a checkpoint file at that time.

GiudGiud commented 1 week ago

@loganharbour thoughts on closing this now that:

we return exit codes from apps
we have can control termination externally using a Terminator UO relying on data that is modified by the WebServerControl

loganharbour commented 1 week ago

I don't think either one of those would be required for this

idaholab / moose