Open andrewritzmann opened 5 years ago
Exception handling is already in MOOSE... it does gracefully handle that. This couled just piggy-back on that system to handle signals during the solve. It would basically immediately "fail" the current solve and return back to the previous timestep where we could execute FINAL
stuff (to get the PerfGraph, etc.) and then end.
If it happens outside of a solve (like during output) then we would just run into it at the start of the next timestep and still do the above.
Sounds good to me.
As an addition here - once we can handle a signal and throw one of our parallel_exceptions... we should also implement the ability to write out a checkpoint file at that time.
@loganharbour thoughts on closing this now that:
I don't think either one of those would be required for this
Rationale
Gracefully terminating MOOSE calculations using a SIGNAL or flag file would be extremely useful for avoiding corrupted output files when MOOSE when a user is forced to use a hard kill to terminate a calculation. This would ensure proper checkpoint files and avoid corrupting any data files used for visualization. While sizing jobs to fit within a cluster allocation is desirable, the fact remains that nonlinear convergence rates can change as the calculation progresses and a best estimate for the end time may turn out to be inappropriate. This also helps users avoid timeouts which could erase data from local scratch directories.
I strongly encourage the use of a SIGNAL because both PBS (through the qsig command) and SLURM (through, e.g., the scancel command or #SBATCH --signal=... in the submit script). If MOOSE terminates cleanly, then the user's submit script can move files as needed before the job times out.
Description
This is an enhancement request. For comparison, other codes check for files or file modifications at each time step. Examples:
Impact
This should impact most strongly the internal API of moose. The main execution loop would need to handle checking for the termination mechanism and initiate the appropriate process. I am not sure how this would influence the multiapp infrastructure. Documentation of this new feature would be required on the wiki to indicate how the user can make use of it.
Attachment
I have attached the relevant conversation from the moose-users list between myself, Jacob Bair, and Cody Permann. moose-users_thread_with_Cody_Permann.txt