idaholab / moose

Multiphysics Object Oriented Simulation Environment
https://www.mooseframework.org
GNU Lesser General Public License v2.1
1.76k stars 1.05k forks source link

Norm of ADVariableValue near 0 returns hard to understand exception #22076

Open GiudGiud opened 2 years ago

GiudGiud commented 2 years ago

Bug description

Taking the norm of an ADVariableValue near 0 returns the message in the Discussion below. This is not ideal for user-facing output

Steps to reproduce

ADRealVectorValue(0).norm() should suffice to reproduce this.

Steps to fix

More floating point exceptions in moose to catch this Metaphysicl/libmesh exception

Impact

Better errors for users

Discussed in https://github.com/idaholab/moose/discussions/22072

Originally posted by **aarograh** September 12, 2022 I'm working on updating submodules and MOOSE environment for my app. The update is a small step forward (`moose-libmesh` version updated from `2022.03.18` to `2022.03.28`) but breaks several of my tests with this error: ``` Floating point exception signaled (floating point divide by zero)! To track this down, compile in debug mode, then in gdb do: break libmesh_handleFPE run ... bt ``` Attaching the debugger and following the instructions given by the error indicates that the error occurs at this line: ``` if (_velocity[_qp].norm() < 1.e-8) ``` `_velocity` is a `ADVectorVariableValue` object. It seems that in the update, the result of the `.norm()` function call has changed in some way. So I have 2 questions: 1. I cannot for the life of me determine how to print out the components of `_velocity[_qp]`. It looks like it's ultimately a templated `libMesh` object. I figure if I could just inspect the object a bit, I might be able to determine what changed and fix it. 2. Any ideas what may have changed with the norm function during that update?
lindsayad commented 2 years ago

This is not a libMesh/MetaPhysicL exception. This is a floating point exception which is not an exception in the C++ sense; it is a signal. You cannot catch it. Perhaps you want to add a libMesh exception to TypeVector::norm? And then if so, we can catch those and emit a mooseError?

GiudGiud commented 2 years ago

It's a good point. I think it's a good idea. We have more and more AD users, and these errors are quite sneaky. Earlier today, we got the same with @tanoret with std::pow(some_ad, 1.5)

lindsayad commented 2 years ago

@roystgnr do you think we should put in exception throwing for operations like TypeVector::norm if all coordinates are 0? This kind of thing is so tricky to make decisions on IMO since we would be performing LIBMESH_DIM new checks for every call to norm. But that's probably premature optimization thinking without any profiling to back up the concern

hugary1995 commented 2 years ago

What's the behavior of taking the norm of a non-AD RealVectorValue? Does it raise an exception or simply returns NaN? I'd prefer NaNs in this case, both for RealVectorValue and ADRealVectorValue.

lindsayad commented 2 years ago

Norm of a 0-vector is 0. The derivative is the only thing that is problematic

hugary1995 commented 2 years ago

Sorry, I meant the derivative, which is NaN.

hugary1995 commented 2 years ago

Reasoning: in tensor_mechanics we decide whether to compute the derivatives based on is_ad, and therefore I would hope we could get consistent behavior. Since for a vector v the hand-coded derivative of its norm w.r.t. itself is v/v.norm(), I would hope the dual numbers of v.norm() to be NaNs as well.

lindsayad commented 2 years ago

We enable the following floating point "exceptions" by default in dbg mode:

void enableFPE(bool on)
{
#if !defined(LIBMESH_HAVE_FEENABLEEXCEPT) && defined(LIBMESH_HAVE_XMMINTRIN_H)
  static int flags = 0;
#endif

  if (on)
    {
#ifdef LIBMESH_HAVE_FEENABLEEXCEPT
      feenableexcept(FE_DIVBYZERO | FE_INVALID);

So both the AD and non-AD code (if it is actually doing v/v.norm()) should raise floating point "exceptions" in dbg mode and according to IEEE indeed should result in NaNs for other compilation modes

roystgnr commented 2 years ago

do you think we should put in exception throwing for operations like TypeVector::norm if all coordinates are 0?

Thinking about it, I can't avoid the conclusion that at some point we ought to have two different behavioral options here:

In one category of use cases, it's entirely reasonable at the low level to return (0, NaN) for the std::norm(0) (value,derivatives). Calculations which then proceed to throw away the derivatives and only use the norm will work fine (albeit slower than they should, but that's what turning SIGFPE on and off helps us debug), and calculations which attempt to use the derivatives too will correctly propagate NaN rather than returning something that might be garbage. If I'm trying to use these derivatives to calculate a sensitivity solve or an adjoint solve, then the fact that I don't actually possess a two-side derivative and I can't calculate a precise sensitivity or adjoint solution is extremely important.

However, that's not what we do for e.g. std::abs(DualNumber) - where we return f'=0 for x=0 - and in the second category of use cases, which IIRC is pretty much all MOOSE code right now, that is clearly the right behavior. If we're just getting a Jacobian to use for a primal solve, we don't actually care if it's an exact Jacobian, we just care that it's close enough to a linearization for us to get as good a quasi-Newton step as we can get, and returning something that's "in the middle of" the set of one-sided derivatives is about as good an approximation as we can get for that purpose, and returning NaN would be something in between an annoying mess (if a kernel has to manually test for NaN around any operation that might not have well-defined derivatives) and an utter disaster (if user code is stuck running with SIGFPE enabled and so doesn't get the chance to test for NaN).

MOOSE in dbg mode is the "utter disaster" case, isn't it? Because enabling SIGFPE and treating it as always an error is a great way to hunt bugs? So we do need to do something about that.

I don't see exception-throwing as the solution, though. If we're not going to catch that exception in user code then we might as well just let the SIGFPE kill the program. And if we are going to catch the exception in user code, we might as well just replace the norm(v) call there with something along the lines of (raw_value(v) == 0 ? ADReal(0) : norm(v)) (with the typing fixed up, and encapsulated as regularized_norm(ADReal) or something. That'll be faster for everybody if we only do those tests where we expect to need them rather than in every single norm() call everywhere, and it should even be faster for the regularized_norm users if we just invoke a test-and-jump rather than of the whole C++ exception-throwing machinery.

The only way I could see an exception as being preferable is if we have cases where norm() is getting called at a low level, and the code calling it doesn't know whether norm(0) should be regularized or should scream-and-die, but code further up the stack does know. ... which doesn't sound entirely implausible, now that I think about it. Do we have cases like that?

dschwen commented 2 years ago

Isn't the limit if the derivative inifinity? Why not just use that instead of NaN?

roystgnr commented 2 years ago

norm((x,y,z)) in 3D is a cone surface in 4D ... it's easy to picture in 2D, with a 3D cone. If you look at the directional derivative at 0, d(norm)/d(direction) will always be 1, not infinity at all, but because there's no gradient vector which has a dot product of 1 with every possible direction there's no well-defined gradient. The "average" gradient over all direction vectors is 0, same as with the two one-sided derivatives of abs() (which is the same as norm() in 1D).

dschwen commented 2 years ago

D'oh, yeah, never mind me. Tip of the cone is definitely not a well defined derivative.

lindsayad commented 2 years ago

The only way I could see an exception as being preferable is if we have cases where norm() is getting called at a low level, and the code calling it doesn't know whether norm(0) should be regularized or should scream-and-die, but code further up the stack does know. ... which doesn't sound entirely implausible, now that I think about it. Do we have cases like that?

I feel like most calls to norm are in user code, or at least at the MOOSE framework level and above.

we might as well just replace the norm(v) call there with something along the lines of (raw_value(v) == 0 ? ADReal(0) : norm(v)) (with the typing fixed up, and encapsulated as regularized_norm(ADReal) or something.

I like that