nutofem / nuto

NuTo - yet another finite element library
https://nuto.readthedocs.io
Boost Software License 1.0
17 stars 5 forks source link

Exceptions #29

Closed Psirus closed 7 years ago

Psirus commented 7 years ago

I want to talk about some of the exceptions (and lack thereof) in NuTo. In particular, there are two instances that bug me most:

  1. The error code from the constitutive law.
  2. The try-catch blocks that do nothing (except making debugging unnecessarily difficult)

The return code problem

Every constitutive law Evaluate returns an error code, which can be of three values (success, not implemented and did not converge). Now why do I think this is bad?

First of all this is the prototypical application for exceptions: if everything goes well, return output values, if something goes wrong, throw an exception. It's not idiomatic C++.

Second, we're adding a lot of unnecessary code between the place where the error occurs and the place where we could possibly handle the error. So let's say MisesPlasticity returns eError::NO_CONVERGENCE, what could possibly be done (and this might have been the original intention) is to decrease the time step in the time integration scheme. But between them, there are a lot of functions. Specifically

All of them now need to pass on the error, and also have a suboptimal interface: the return values could be the output of the function, but instead the output is passed into the function via reference, while the return value is this error code.

And now, for the cherry on top: the Structure::Evaluate simply does if (error != SUCCESSFUL) throw Exception;, with no one handling it. So we might as well throw it in the ReturnMapping.

Most of these arguments are also explained in the ISO C++ FAQ, specifically this one and this one.

Now, enough moaning, this is how it could be done instead: The point of the error throws, and the somewhere down the stack, it is handled. For example:

MisesPlasticity::ReturnMapping(...)
{
    // ... do your thing ...
    // if it did not converge, throw
    if (iterations > 100)
    {
        throw NuTo::LawDidNotConverge();
    }
    // otherwise
    return {NewStress, NewTangent, NewStaticData};
    // notice how I don't have to allocate the outputs in the caller anymore
}

and your time integration catches it

TimeIntegration::Solve(...)
{
    // ... do your thing ...
    try
    {
        structure.Evaluate(); 
    }
    catch (NuTo::LawDidNotConverge& e)
    {
        TryAgainWithSmallerTimeStep;
    }
    // ...
}

The unnecessary catches

There are many unnecessary catch blocks in NuTo, prohibiting the effective use of the debugger. So for example a constitutive law throws an exception (e.g. a parameter is missing), you're probably going to get an output sort of like this

[ConstitutiveLaw::Evaluate]
    A parameter is missing.
[IPAdditiveOutput::AdditiveOutputEvaluate]
    Exception while evaluating constitutive law attached to an additive output.
[ElementBase::EvaluateConstitutiveLaw]
    Error evaluating the constitutive model.
[Structure::Evaluate]
    Error evaluating the structure.
[NewmarkDirect::Solve]
    Error performing Newton-Raphson iteration.

which is, admittedly, quite nice as a quick error log. But now if you want to find the error using a debugger, your backtrace is gone:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007ffff271640a in __GI_abort () at abort.c:89
#2  0x00007ffff302c04d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff302a016 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff302a061 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff302a2c9 in __cxa_rethrow () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff782e0cb in NuTo::NewmarkDirect::Solve (this=0x7fffffffd9a0, rTimeDelta=3600)
    at /home/cpohl/Code/Cpp/nuto/src/mechanics/timeIntegration/NewmarkDirect.cpp:446
#7  0x000000000040a19d in main () at /home/cpohl/Code/Cpp/nuto/myNutoExamples/DamageShell.cpp:142

You can't go back to the original error location and all the variables that were allocated there ("Which parameter was it trying to access?") are long gone. I think that if you can't handle the error (and none of these catches do), don't catch it at all.

If we do this, then the "quick error log" as it is now would be gone, but the stacktraces are so much more useful. The only issue I see with this is Python, but there are other ways to do handle this. We could for example print the stacktrace at the error site using something like this.


Edit: The question I've heard multiple times when this comes up is "Why don't you simply say catch throw in your debugger?". Here is why: the way exceptions work in NuTo is basically "This is what didn't work. Terminate program.". But this is not all exceptions can do. As hinted previously, you can handle errors ("if the file is already there, choose a different name; if the law did not converge, make the timestep in Newmark smaller; etc.") and then continue your program. If we want to handle the errors in some cases, than you may have multiple exceptions before the one that you can't handle. In that case you need to say continue, check if it now the exception you want, if not say continue, etc. The basic assumption should be that exceptions can occur and can be handled.

TTitscher commented 7 years ago

I experience the same problems with the backtrace. My solution (and many others do it as well) is to use the "break on throw" option of the debugger to get a meaningful stacktrace and the possibility to have a look at the current values. So I would agree with your suggestions.

On top of that, many exceptions may be obsolete when we use objects instead of IDs.

void DoStuffWithNodeGroup(int rNodeGroupID)
   // may throw "ID does not exist"
   // may throw "ID does not correspond to a Group<Node>"

void DoStuffWithNodeGroup(Group<Node>& rNodeGroup)
  // this method does not throw...
Psirus commented 7 years ago

I've tried to implement it, and the result is right now in the branch remove_error. The commit e35f88fd9e35191b7c1e4004336b69074bd55086 fixes the first part, and 688de77ff4d65b7b3a9427d3327ce0d1bc9c16b5 the second. Note that the interfaces for all the functions are not "better" yet, their return value type is simply void. Also, the second commit is so big because of all the reindenting. So let me know whether this would be fine to merge.

joergfunger commented 7 years ago

I perfectly agree. The original idea using the "not converged" exception was used that way (reduce the time step), but somehow got lost in the last updates. The only thing to consider is that there might be a problem when passing the exception directly from the constitutive law to the time integration scheme (and leaving out intermediate steps). In some cases, it is important to reset some values that might have changed (and now need to be reset to the initial values related to the old time step). Another even more serious issue is the requirement of exception safe code, which is especially important for parallel applications. Say one ip on one processor returns an exception, but all the others work fine. This now requires to communicate the exception to the other processors. In our case, this communication should be done in the assembly. If we now just throw in the constitutive law and catch in the time integration scheme, this would not work. Exceptions may not be thrown and caught across omp constructs. That is, if a code inside an omp for throws an exception, the exception must be caught before the end of the loop iteration; and an exception thrown inside a parallel section must be caught by the same thread before the end of the parallel section. This problem is even more difficult to handle when using MPI.

This is the reasons, why I currently do think that exceptions should not be used to avoid the need to pass return values (not_converged) along the whole chain from where it is happening to where it is processed.

TTitscher commented 7 years ago

I think we should use exceptions over error codes for the reasons in the links above. And all the problems @joergfunger mentioned, have to be solved with error codes as well. I do not see the advantage. But let's try to solve them:

The LawDitNotConverge exception is currently the only one worth catching. (We cannot/should not recover after exceptions like "Youngs modulus not set" or "Element has no Law assigned").

In some cases, it is important to reset some values that might have changed (and now need to be reset to the initial values related to the old time step).

In these cases, we could catch/rethrow the exception, wherever it is needed.

try 
 // ...
catch (NuTo::LawDidNotConverge& e) {
    ResetValues();
    throw;
}

If it is a LawDidNotConverge exception, this code should only be in the time integration.

parallel applications

How to catch exceptions in OPENMP?

So we should catch all exceptions inside the parallel region and rethrow in the serial region, as shown in the link above, and as said by Jörg. I see no problem with that. Additionally, we could put the parallel try/catch inside a #ifdef OPENMP.

MPI

The communication of the error can be the same as when using error codes. But instead of an error code, we catch the exception before a barrier or something.

joergfunger commented 7 years ago

I agree but there is more to be thought of. The current idea for using the parallel solvers from the Trinlinos-package are based on the idea to assemble all matrices on each processor in parallel (MPI). Only at the point, where the solver is called, a transfer to the Trilinos data structure is performed, solve and backpropagate the solution. In this case, all functions in the time integration scheme that evaluate the material law (internal gradient, hessian) that might throw have to have a barrier and communicate that everything worked well and no error occured (or in the opposite case that an error occured and the time step is reduced).

Psirus commented 7 years ago

I still vote to merge this as is. While I understand the objections, I'd prefer to make no further modifications at the moment for two reasons:

  1. The previous version with the error codes can't handle any of these cases either.
  2. The problems are hypothetical.

Why do I think they are hypothetical:

The requirement that any change must also solve any hypothetical problems in the future is just too hard. I'd rather solve a real problem now, than a hypothetical one later.

joergfunger commented 7 years ago

Maybe there is a misunderstanding. I strongly support the merge as is. My remarks were only to highlight the problems that I see as soon as we use exceptions and process them as a standard design strategy instead of considering exceptions only as an error that has to be fixed.