openmc-dev / openmc

OpenMC Monte Carlo Code
https://docs.openmc.org
Other
778 stars 507 forks source link

Error behavior when used as a library #1778

Open paulromano opened 3 years ago

paulromano commented 3 years ago

One of OpenMC's "bad" behaviors right now is that when it encounters an error, it most circumstances it will call std::exit (or MPI_Abort when using MPI). This is not a desirable behavior if OpenMC is used as a library (e.g., as part of a multiphysics application). It also sucks when you're using the openmc.lib Python bindings to the C/C++ API and an error takes down the whole executable rather than bubbling up as an exception. We should be following community guidelines with respect to error handling. For example, one of the recommendations in the xSDK community package policies (R3) is:

It is recommended that each package adopt and document a consistent system for propagating/returning error conditions/exceptions and provide an API for changing the behavior. For example, all routines may, by default, return an error code with the option of changing it to generate an abort on the error (for running in the debugger). No package should have hardwired calls to abort, exit, or MPI_Abort(). Also, no package should have hardwired print statements for error conditions. Each package should document which error codes/exceptions are recoverable, which may result in lost resources (for example, unfreed memory), and which indicate that the process may be in an undefined or totally broken state (for example, after a segmentation violation). It is the responsibility of the calling routine not to simply continue the computation when a “hard” error is returned; the calling routine will likely, by default, call an abort, but again that should be possible to override.

What we have is currently a bit of a mess:

  1. Some errors will result in std::exit or MPI_Abort
  2. Some errors result in C++ exceptions being thrown
  3. Some C API functions will return an error code (these are indeed converted to exceptions in openmc.lib Python bindings)
gridley commented 3 years ago

This is something I'd like to help take on because error handling on GPU code is currently impossible. My current thinking is that we should add some new global variable like error_state that is an enum of possible error codes (or maybe each thread should own a copy?). The python API functions could work as-is and return that code. Then if running main() from C++, you'd just return back to main() and add some function for printing helpful error messages before exiting or possibly returning that error code.

If we took this approach on GPU, it would be nice, because this error flag could be set by some thread on the device, then the appropriate handling be done on the CPU after the kernel is done.

paulromano commented 3 years ago

I think that approach sounds pretty nice. One of the nice things about storing the state in a global as opposed to returning the value is that you don't need to pass the return value through nested function calls. Exceptions also solve that problem, but of course they are no-go for GPUs. Note that we already have a global openmc_err_msg variable that is used by openmc.lib to get the error message.

For simplicity, we might want to just have the error state be an int rather than an enum so that it is clearly visible through a C API.