ls1mardyn / ls1-mardyn

ls1-MarDyn is a massively parallel Molecular Dynamics (MD) code for large systems. Its main target is the simulation of thermodynamics and nanofluidics. ls1-MarDyn is designed with a focus on performance and easy extensibility.
http://www.ls1-mardyn.de
Other
28 stars 15 forks source link

Use proper exit codes #341

Open HomesGH opened 3 weeks ago

HomesGH commented 3 weeks ago

Description

For now, the exit codes are just randomly set. This PR makes the exit message more meaningful by adding the file, function and line where the exit code was called. In addition, it enforces an error message to be passed.

This was achieved by adding the wrapper MARDYN_EXIT to src/utils/mardyn_assert.h and calling the wrapper when exiting instead of the function mardyn_exit.

An example error output now looks like this (each rank prints the error messages, see comments below):

ERROR:  20241010T205703 0.001553 [0]    Exit called from function `getValue` in file `src/utils/xmlfile.cpp:554` with message:
ERROR:  20241010T205703 0.001561 [0]    ERROR parsing all chars of "100000." from tag "<steps>" in xml file
ERROR:  20241010T205703 0.001567 [0]    This might be the result of using a float while an integer is expected.

Resolved Issues

Questions

~Should we replace the current codes with something else? If so, what? e.g. that one:~ --> We decided to drop all exit codes and just print a meaningful error message and exit with EXIT_FAILURE. See comments below and edit history of this comment for details.

FG-TUM commented 2 weeks ago

Should we replace the current codes with something else?

I like the idea of the macro that prints the error coordinates. Maybe it could also have a mandatory string argument that has to contain an error message that is logged. With all that, I don't see a reason to have any cryptic exit codes at all. We could simply use EXIT_FAILURE.

HomesGH commented 5 days ago

Just for clarity please update the initial PR description with an example of a new error message.

I have set the output now to Log::global_log->error() instead of Log::global_log->error_always_output(). With the latter, each rank prints the message. That means, for large simulations, we would get a high number of (probably mainly) duplicated output.

FG-TUM commented 5 days ago

I have set the output now to Log::global_log->error() instead of Log::global_log->error_always_output(). With the latter, each rank prints the message. That means, for large simulations, we would get a high number of (probably mainly) duplicated output.

Doesn't this mean that if an error occurs on a rank other than 0 we get no error at all?

HomesGH commented 5 days ago

I have set the output now to Log::global_log->error() instead of Log::global_log->error_always_output(). With the latter, each rank prints the message. That means, for large simulations, we would get a high number of (probably mainly) duplicated output.

Doesn't this mean that if an error occurs on a rank other than 0 we get no error at all?

True… Before, the message was mainly only print by rank 0 (due to Log::global_log->error()) but exited anyway.

What do you suggest?

  1. Print with error_always_output() (duplicated error messages)
  2. Add a if (rank==0) to mardyn_exit
FG-TUM commented 5 days ago

There are different types of errors: Those that happen on every rank (e.g. errors while parsing the input file), there it would probably be fine if only one rank prints the message, and those where only one rank is affected, e.g. when a particle can not be placed properly after an MPI rebalance. In the latter case, it is crucial that each rank prints their error message.

For simplicity I think I'd just always print from each rank. But if you feel that is too spammy you could add a bool argument to MARDYN_EXIT like onlyPrintMsgFromMaster for errors of the first category described above. This, however, should be treated with care and probably should default to false.

HomesGH commented 5 days ago

There are different types of errors: Those that happen on every rank (e.g. errors while parsing the input file), there it would probably be fine if only one rank prints the message, and those where only one rank is affected, e.g. when a particle can not be placed properly after an MPI rebalance. In the latter case, it is crucial that each rank prints their error message.

For simplicity I think I'd just always print from each rank. But if you feel that is too spammy you could add a bool argument to MARDYN_EXIT like onlyPrintMsgFromMaster for errors of the first category described above. This, however, should be treated with care and probably should default to false.

The error messages are now printed with Log::global_log->error_always_output() so that every rank prints it. I am not a big fan of an additional argument, since it makes it more complicated. This would probably cause some trouble/confusion in the future.