trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 564 forks source link

NOX: Treat Exception as Solve Failure #1608

Closed jmgate closed 4 years ago

jmgate commented 7 years ago

Charon has run into an issue where in the midst of a LOCA continuation run a preconditioner winds up throwing an exception. It would be ideal for this instance to be treated as a solve failure such that LOCA could back up, decrease the step size, and keep on going. We may be able to accomplish this by adding some exception handling logic to NOX::Thyra::Group::updateLOWS(), and then modify that routine to return something that will eventually indicate a solve failure. @trilinos/nox

jmgate commented 7 years ago

Email from @jmgate to @etphipp:

Hey Eric,

Suzey Gao has a Charon test case she’s trying to run with a LOCA sweep. There’s a block LDU preconditioner operator getting built because her example uses a current constraint on some terminal of the device. In the midst of creating the preconditioner, it realizes the Schur complement is singular and throws an exception. Apparently at that point Charon quits and shows you the exception instead of cutting down the LOCA step size and trying again. Suzey’s current workaround is to restart the simulation using the solution from the last successful step as an initial guess, and then using the initial step size again (as opposed to the max step size). Do we need to be doing something such that we catch and handle this exception that’s getting thrown such that LOCA can cut the step size down and then keep on chugging?

Many thanks,

Jason

jmgate commented 7 years ago

Response:

Hi Jason,

So who is throwing the exception? Is it the preconditioning package (Teko, I guess)? The natural thing to do here is have LOCA catch the exception and treat it as a failed nonlinear solve step (in which case it would automatically reduce the step size and try again). However there is no logic in LOCA currently to do that. I would have to look at the code a little bit to see how easy/hard that would be to do in LOCA.

-Eric

jmgate commented 7 years ago

Response:

Teko is throwing the exception:

Teko: "rebuildInverse" could not construct the inverse operator using "Teko::AutoClone<charon:: Schur2x2PreconditionerFactory, charon::Schur2x2PreconditionerFactory>" THROWN EXCEPTION /home/xngao/Program/Trilinos/packages/thyra/core/src/support/operator_solve/client_support/ Thyra_DefaultSerialDenseLinearOpWithSolve_def.hpp:198: Throw number = 2 Throw test that evaluated to true: (dim) != (rank) Error, (dim = 1) != (rank = 0)!

Then Charon is catching that and rethrowing with a more useful message:

Teko: "rebuildInverse" could not construct the inverse operator using "Teko::AutoClone<charon:: Schur2x2PreconditionerFactory, charon::Schur2x2PreconditionerFactory>" THROWN EXCEPTION /home/xngao/Program/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:164: Throw number = 3 Throw test that evaluated to true: true Schur2x2PreconditionerFactory::buildPreconditionerOperator(): I'm afraid it looks like S is not invertible.

Any other information you need?

Jason

jmgate commented 7 years ago

Response:

Well, you could try adding this patch to LOCA:

diff --git a/packages/nox/src-loca/src/LOCA_Stepper.C b/packages/nox/src-loca/src/LOCA_Stepper.C index d31052626d..24301d2d7c 100644 --- a/packages/nox/src-loca/src/LOCA_Stepper.C +++ b/packages/nox/src-loca/src/LOCA_Stepper.C @@ -588,7 +588,13 @@ LOCA::Stepper::compute(LOCA::Abstract::Iterator::StepStatus stepStatus) printStartStep(); // Compute next point on continuation curve

  • solverStatus = solverPtr->solve();
  • try {
  • solverStatus = solverPtr->solve();
  • }
  • catch(...) {
  • // Treat any un-caught exception as a solver failure
  • solverStatus = NOX::StatusTest::Failed;
  • } // Check solver status if (solverStatus == NOX::StatusTest::Failed) {

It adds a try { ... } block around the solve and catches any previously un-caught exception, treating it as a solver failure. That's a bit extreme, because some exceptions you might want to pass along, although I have no idea how you would determine that. It's also not clear to me if LOCA should be doing this, or if this is something that should be done inside NOX.

Roger, do you have any thoughts on that?

-Eric

jmgate commented 7 years ago

From @rppawlo:

I’m surprised that any steps are successful. Are you able to take multiple steps past the initial step before this failure occurs? The failure looks like a size check in Teko objects:

Throw test that evaluated to true: (dim) != (rank)

We should really dig into Teko and see what is happening here. Its possible that the LOCA augmented system is being seen by Teko as a block system and it is trying to invert the blocks.

My preference is to fix the preconditioner. If Teko can’t invert the blocks then that is an exceptional case. I don’t think LOCA or NOX should try to recover from this at the status test level.

Roger

jmgate commented 7 years ago

Response:

In Jason's original email it sounded like this was happening somewhere in the middle of a continuation run. I am guessing from the exception message that Teko has a 1x1 block that is zero, or zero to some tolerance in a rank calculation.

-Eric

I don't think it is possible that Teko is getting an augmented system created by LOCA, since LOCA can't create augmented systems in the form Teko would accept. It may be getting an augmented system created by Charon for the current constraint.

-Eric

jmgate commented 7 years ago

Response:

Yup, this is in the middle of a continuation run from 0 to 30 with an initial step size of 0.01. LOCA ramps up to a step size of 1 and gets as far as 10.5 for the continuation parameter before we run into this problem.

Specifically, Charon’s using a block LDU preconditioner in cases with a current constraint. If A = {{F, U}, {L, G}} is the blocked system, it computes S = G - L F^{-1} U, but then it tries to invert S when it’s singular. I was catching the exception thrown by Teko and rethrowing with a more useful error message so I could know what was going on. Any ideas as to what I should do instead? I suppose at the very least I could catch the Teko exception and return the identity as the preconditioner and we could just see what happens. Problem is this particular test case runs for a good eight hours before the problem manifests, so just trying things to see what works is very time consuming.

Thanks for the help,

Jason

jmgate commented 7 years ago

At this point I'm waiting on Suzey Gao to return from vacation so I can get my hands on her actual example. Once I have it I can start experimenting with catching the exception in NOX::Thyra::Group::updateLOWS() and figuring out how we can use that to indicate a solve failure. I'll submit a pull request once I have something working so we can decide if it's the right way to go.

rppawlo commented 7 years ago

So after our discussion yesterday with Teko team, I think what we really need is to add a generic exception in Thyra or Stratimikos that all preconditioners can use to throw if they encounter a catastrophic failure. Then we could put into the NOX or LOCA a catch on this particular exception that allows LOCA to cut the continuation step and restart the solver. The reason I would like a specific exception is that we need to differentiate this from all other exceptions where the entire simulation should truly terminate (don't want to waste compute cycles running garbage for other exceptions). Does this sound reasonable @jmgate @eric-c-cyr @etphipp @egphill ?

jmgate commented 7 years ago

Yes, that sounds good to me.

jmgate commented 7 years ago

@rppawlo, do you have a guess as to how long it might take to build in this generic exception to Thyra/Stratimikos? Just need to figure out if I tell Charon to wait a week or give everyone a patch to tide them over.

jmgate commented 7 years ago

Is adding this generic exception to @trilinos/thyra or @trilinos/stratimikos something I should take on? I don't really have any experience contributing to either package.

bartlettroscoe commented 7 years ago

Is the issue that a preconditioner is applied as Thyra::LinearOpBase::apply()? That function has no way of failing (it is assumed to always pass). So would you need an exception called something like Thyra::LinearOpApplyFailed which would mean that Thyra::LinearOpBase::apply() really should have been able to compute the application of the linear operator but could not for some reason (e.g. max num iterations exceeded). I think you want a specific name like this so you don't accidentally catch some other exception that should bring the program down.

It seems like that might not be too hard to add. But the problem is that various solver packages would need to be upgraded to to respond to that exception in a logical way. But if you only upgrading LOCA to respond to this then this would provide value.

jmgate commented 6 years ago

@rppawlo, any ETA on a preconditioner catastrophic failure exception in Thyra or Stratimikos that NOX or LOCA can then catch?

bartlettroscoe commented 6 years ago

All that needs to be done on the Thyra side is to define the exception class Thyra::LinearOpApplyFailed and then document it in the Thyra::LinearOpBase::apply() function documentation. Then subclasses of Thyra::LinearOpBase need to throw an exception of that type and clients of Thyra::LinearOpBase need to catch exceptions of that type.

@jmgate, can you take a stab at updating the Thyra_OperatorVectorTypes.hpp file to add that new exception type (see other examples there) and then update the file Thyra_LinearOpBase_decl.hpp file to document that exception class in the documentation for the functioin Thyra::LinearOpBase::apply()? Then I can review it.

jmgate commented 6 years ago

Yup, I'll give it a shot.

jmgate commented 6 years ago

Added the exception—working on documentation…

jmgate commented 6 years ago

This work is complete and ready to be merged in #2016—already reviewed and approved by @rppawlo, and @etphipp. Waiting on a review of the @trilinos/thyra changes from @bartlettroscoe and then it's good to go.

bartlettroscoe commented 6 years ago

Waiting on a review of the @trilinos/thyra changes from @bartlettroscoe and then it's good to go.

@jmgate,

I thought I submitted my review yesterday in #2016 but it looks like it did not complete. I completed it just now. Basically I requested a one-line change (needed to satisfy the interface).

jmgate commented 6 years ago

Yeah GitHub's been a little glitchy for me lately, which is why I'm pestering people in case they're missing things like I am.

bartlettroscoe commented 6 years ago

FYI: I pushed a commit in the branch:

that demonstrates how you might update the Thyra::LinearOpBase interface to allow for a runtime numerical (or other) error in the failure to apply a linear op and for NOX to not have a required dependency on Thyra. But I think this is not quite right and I would need to study the full NOX implementation and the Thyra adapter subclasses to really do the right thing. Anyway, not that big of a deal of the NOX use case works.

jmgate commented 6 years ago

Closed in #2016.

jmgate commented 6 years ago

@rppawlo, @etphipp, apparently this issue is still causing some problems for Charon. @suzeygao reports she's seeing the following some 30 hours into a LOCA continuation run:

************************************************************************
-- Nonlinear Solver Step 4 -- 
||F|| =       nan  step = 1.000e+00  dx = 1.194e+05
************************************************************************

       CALCULATING FORCING TERM
       Method: Constant
       Forcing Term: 1e-06
 Teko: "rebuildInverse" could not construct the inverse operator using "Thyra::AmesosLinearOpWithSolveFactory{solverType=Klu}"

 *** THROWN EXCEPTION ***
 /ascldap/users/xngao/Program/Trilinos/packages/stratimikos/adapters/amesos/src/Thyra_AmesosLinearOpWithSolveFactory.cpp:346:

 Throw number = 1

 Throw test that evaluated to true: 0!=err

 Error, NumericFactorization() on amesos solver of type 'Amesos_Klu'
 returned error code -22!
 ************************
 Teko: "rebuildInverse" could not construct the inverse operator using "Teko::AutoClone<charon::Schur2x2PreconditionerFactory, charon::Schur2x2PreconditionerFactory>"

 *** THROWN EXCEPTION ***
 /ascldap/users/xngao/Program/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:139:

 Throw number = 2

 Throw test that evaluated to true: true

 Schur2x2PreconditionerFactory::buildPreconditionerOperator():  I'm afraid it looks like F is not invertible.
 ************************
p=0 | SOLVE FAILURE: std::exception
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 14.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

This is triggered in Charon_Schur2x2PreconditionerFactory.cpp, where we have

...
else // if invF isn't null
{
  try
  {
    rebuildInverse(*invFactory_, F, invF);
  }
  catch (...)
    TEUCHOS_TEST_FOR_EXCEPTION(true, SolverFailure,
      "Schur2x2PreconditionerFactory::buildPreconditionerOperator():  "   \
      "I'm afraid it looks like F is not invertible.")
} // end if invF is null or not
...

That catch is what's throwing the NOX::Exceptions::SolverFailure, which I think should be caught here, such that LOCA can decrease the step size and try again. Any idea why that catch might not be working? Is MPI_Abort() somehow getting called before NOX can catch the exception?

jmgate commented 6 years ago

@rppawlo, @etphipp, any ideas as to what might be going on above?

etphipp commented 6 years ago

Are you sure SolverFailure is the exception being thrown by Teko? Your output has this just before the abort:

p=0 | SOLVE FAILURE: std::exception

which makes me think std::exception was thrown instead.

jmgate commented 6 years ago

I don't know. It looks like Stratimikos is throwing a CatastophicSolveFailure in the midst of the Teko::rebuildInverse() in Charon_Schur2x2PreconditionerFactory.cpp. This is getting caught and rethrown as a NOX::Exceptions::SolverFailure, which inherits from std::logic_error. Is it possible the std::exception is getting thrown elsewhere and that's causing everything to bug out? I'm running this case myself now, but it'll take it a while to get to the point of failure.

bartlettroscoe commented 6 years ago

@jmgate, if these exceptions are thrown with TEUCHOS_TEST_FOR_EXCPETION() or one of the related macros, then it will print a throwNumber. That can be used to set a breakpoint to stop right at the place where the exception is being thrown. For details on this see Section "5.11.7 Exception handling and debugging" in:

After you see where that std::exception is thrown from, you should be able to back up from there to trace what is happening by running the program again with an the throwNumber deincremented by one as so forth.

rppawlo commented 6 years ago

We really need the stack trace to see what is going on. Can you reproduce from a restart?

bartlettroscoe commented 6 years ago

We really need the stack trace to see what is going on. Can you reproduce from a restart?

If you have an working version of the BinUtils library, you should be able to turn on stack tracking so when that final exception is caught, you can see the stack trace. See:

jmgate commented 6 years ago

This problem may not actually exist. I was able to run the example in question and the entire continuation run completed successfully without throwing any exceptions. I've asked @suzeygao to do a clean build pointing to the same commits I'm looking at to see if we can reproduce the problem. Sorry to have bothered you all.

jmgate commented 6 years ago

Sorry it's taken me so long to get back to this. I was able to reproduce the problem Suzey's seeing, and was able to reproduce it from a restart so we don't have to wait 30 hours per run to debug. I obtained the stracktraces below by running

$ gdb --args ../../driver/charon_mp.exe --i=input.xml --current
(gdb) set pagination off
(gdb) catch throw
(gdb) commands
>backtrace
>continue
>end
(gdb) run

This spits out a stacktrace any time throw is called. The ones below are the ones that appear at the end of the run before Charon quits.

Stacktraces (click to expand)

Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2acab130, tinfo=0xa053e68 , dest=0x732f0e2 ) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75      {
#0  __cxxabiv1::__cxa_throw (obj=0x2acab130, tinfo=0xa053e68 , dest=0x732f0e2 ) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1  0x00000000083fe166 in Thyra::AmesosLinearOpWithSolveFactory::initializeOp (this=0x28a25fa0, fwdOpSrc=..., Op=0x29f62e10, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/stratimikos/adapters/amesos/src/Thyra_AmesosLinearOpWithSolveFactory.cpp:344
#2  0x0000000007e908fb in Teko::SolveInverseFactory::rebuildInverse (this=0x28a23ee0, source=..., dest=...) at /workspace/Trilinos/packages/teko/src/Teko_SolveInverseFactory.cpp:152
#3  0x0000000007e4e5a6 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:181
#4  0x0000000005876503 in charon::Schur2x2PreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, op=..., state=...) at /workspace/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:134
#5  0x0000000007e4bdf0 in Teko::BlockPreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, lo=..., state=...) at /workspace/Trilinos/packages/teko/src/Teko_BlockPreconditionerFactory.cpp:68
#6  0x0000000007e66cc8 in Teko::PreconditionerFactory::initializePrec (this=0x292f1c28, ASrc=..., prec=0x2acaac00, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerFactory.cpp:117
#7  0x0000000007e84593 in Teko::PreconditionerInverseFactory::rebuildInverse (this=0x28a21440, source=..., dest=...) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerInverseFactory.cpp:241
#8  0x0000000007e4e5a6 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:181
#9  0x0000000007e95270 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#10 0x0000000007e9486c in Teko::StratimikosFactory::initializePrec (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#11 0x00000000075956c7 in NOX::Thyra::Group::updateLOWS (this=0x25ebb330) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:884
#12 0x0000000007593f43 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:739
#13 0x0000000007593a02 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:643
#14 0x000000000746dc6c in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x262983c0, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#15 0x00000000074f0edc in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff8fd0, params=..., op=..., B=..., C=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#16 0x000000000746edd1 in LOCA::BorderedSolver::Bordering::applyInverse (this=0x262a1820, params=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#17 0x000000000741e854 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x26296e10, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#18 0x000000000741d8fc in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x26296e10, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#19 0x0000000007427fb2 in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x262925f8, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#20 0x00000000075db2b6 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#21 0x00000000075f8e5e in NOX::Direction::Generic::compute (this=0x262b4630, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#22 0x00000000075db684 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#23 0x000000000752678c in NOX::Solver::LineSearchBased::step (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#24 0x0000000007526b3f in NOX::Solver::LineSearchBased::solve (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:234
#25 0x000000000738415f in LOCA::Stepper::start (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
#26 0x0000000007403f29 in LOCA::Abstract::Iterator::run (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
#27 0x0000000006955c74 in Piro::LOCASolver::evalModelImpl (this=0x25dc2aa0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#28 0x00000000047dc033 in Thyra::ModelEvaluatorDefaultBase::evalModel (this=0x25dc2c30, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#29 0x000000000463bfc9 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775
 Teko: "rebuildInverse" could not construct the inverse operator using "Thyra::AmesosLinearOpWithSolveFactory{solverType=Klu}"

 *** THROWN EXCEPTION ***
 /workspace/Trilinos/packages/stratimikos/adapters/amesos/src/Thyra_AmesosLinearOpWithSolveFactory.cpp:346:

 Throw number = 1

 Throw test that evaluated to true: 0!=err

 Error, NumericFactorization() on amesos solver of type 'Amesos_Klu'
 returned error code -22!
 ************************
Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2aa8aac0, tinfo=0xbd74320 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4637e20 <_ZNSt9exceptionD1Ev@plt>) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75      {
#0  __cxxabiv1::__cxa_throw (obj=0x2aa8aac0, tinfo=0xbd74320 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4637e20 <_ZNSt9exceptionD1Ev@plt>) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1  0x0000000007e4e722 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:193
#2  0x0000000005876503 in charon::Schur2x2PreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, op=..., state=...) at /workspace/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:134
#3  0x0000000007e4bdf0 in Teko::BlockPreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, lo=..., state=...) at /workspace/Trilinos/packages/teko/src/Teko_BlockPreconditionerFactory.cpp:68
#4  0x0000000007e66cc8 in Teko::PreconditionerFactory::initializePrec (this=0x292f1c28, ASrc=..., prec=0x2acaac00, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerFactory.cpp:117
#5  0x0000000007e84593 in Teko::PreconditionerInverseFactory::rebuildInverse (this=0x28a21440, source=..., dest=...) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerInverseFactory.cpp:241
#6  0x0000000007e4e5a6 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:181
#7  0x0000000007e95270 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#8  0x0000000007e9486c in Teko::StratimikosFactory::initializePrec (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#9  0x00000000075956c7 in NOX::Thyra::Group::updateLOWS (this=0x25ebb330) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:884
#10 0x0000000007593f43 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:739
#11 0x0000000007593a02 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:643
#12 0x000000000746dc6c in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x262983c0, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#13 0x00000000074f0edc in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff8fd0, params=..., op=..., B=..., C=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#14 0x000000000746edd1 in LOCA::BorderedSolver::Bordering::applyInverse (this=0x262a1820, params=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#15 0x000000000741e854 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x26296e10, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#16 0x000000000741d8fc in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x26296e10, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#17 0x0000000007427fb2 in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x262925f8, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#18 0x00000000075db2b6 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#19 0x00000000075f8e5e in NOX::Direction::Generic::compute (this=0x262b4630, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#20 0x00000000075db684 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#21 0x000000000752678c in NOX::Solver::LineSearchBased::step (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#22 0x0000000007526b3f in NOX::Solver::LineSearchBased::solve (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:234
#23 0x000000000738415f in LOCA::Stepper::start (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
#24 0x0000000007403f29 in LOCA::Abstract::Iterator::run (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
#25 0x0000000006955c74 in Piro::LOCASolver::evalModelImpl (this=0x25dc2aa0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#26 0x00000000047dc033 in Thyra::ModelEvaluatorDefaultBase::evalModel (this=0x25dc2c30, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#27 0x000000000463bfc9 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775
Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2ac9b0b0, tinfo=0x98089a8 , dest=0x587c132 ) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75      {
#0  __cxxabiv1::__cxa_throw (obj=0x2ac9b0b0, tinfo=0x98089a8 , dest=0x587c132 ) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1  0x0000000005877074 in charon::Schur2x2PreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, op=..., state=...) at /workspace/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:137
#2  0x0000000007e4bdf0 in Teko::BlockPreconditionerFactory::buildPreconditionerOperator (this=0x292f1c28, lo=..., state=...) at /workspace/Trilinos/packages/teko/src/Teko_BlockPreconditionerFactory.cpp:68
#3  0x0000000007e66cc8 in Teko::PreconditionerFactory::initializePrec (this=0x292f1c28, ASrc=..., prec=0x2acaac00, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerFactory.cpp:117
#4  0x0000000007e84593 in Teko::PreconditionerInverseFactory::rebuildInverse (this=0x28a21440, source=..., dest=...) at /workspace/Trilinos/packages/teko/src/Teko_PreconditionerInverseFactory.cpp:241
#5  0x0000000007e4e5a6 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:181
#6  0x0000000007e95270 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#7  0x0000000007e9486c in Teko::StratimikosFactory::initializePrec (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#8  0x00000000075956c7 in NOX::Thyra::Group::updateLOWS (this=0x25ebb330) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:884
#9  0x0000000007593f43 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:739
#10 0x0000000007593a02 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:643
#11 0x000000000746dc6c in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x262983c0, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#12 0x00000000074f0edc in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff8fd0, params=..., op=..., B=..., C=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#13 0x000000000746edd1 in LOCA::BorderedSolver::Bordering::applyInverse (this=0x262a1820, params=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#14 0x000000000741e854 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x26296e10, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#15 0x000000000741d8fc in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x26296e10, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#16 0x0000000007427fb2 in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x262925f8, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#17 0x00000000075db2b6 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#18 0x00000000075f8e5e in NOX::Direction::Generic::compute (this=0x262b4630, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#19 0x00000000075db684 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#20 0x000000000752678c in NOX::Solver::LineSearchBased::step (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#21 0x0000000007526b3f in NOX::Solver::LineSearchBased::solve (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:234
#22 0x000000000738415f in LOCA::Stepper::start (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
#23 0x0000000007403f29 in LOCA::Abstract::Iterator::run (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
#24 0x0000000006955c74 in Piro::LOCASolver::evalModelImpl (this=0x25dc2aa0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#25 0x00000000047dc033 in Thyra::ModelEvaluatorDefaultBase::evalModel (this=0x25dc2c30, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#26 0x000000000463bfc9 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775
 Teko: "rebuildInverse" could not construct the inverse operator using "Teko::AutoClone"

 *** THROWN EXCEPTION ***
 /workspace/Trilinos/tcad-charon/src/solver/Charon_Schur2x2PreconditionerFactory.cpp:139:

 Throw number = 2

 Throw test that evaluated to true: true

 Schur2x2PreconditionerFactory::buildPreconditionerOperator():  I'm afraid it looks like F is not invertible.
 ************************
Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2aca3ab0, tinfo=0xbd74320 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4637e20 <_ZNSt9exceptionD1Ev@plt>) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75      {
#0  __cxxabiv1::__cxa_throw (obj=0x2aca3ab0, tinfo=0xbd74320 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4637e20 <_ZNSt9exceptionD1Ev@plt>) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1  0x0000000007e4e722 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:193
#2  0x0000000007e95270 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#3  0x0000000007e9486c in Teko::StratimikosFactory::initializePrec (this=0x228a53f0, fwdOpSrc=..., prec=0x2628cba0, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#4  0x00000000075956c7 in NOX::Thyra::Group::updateLOWS (this=0x25ebb330) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:884
#5  0x0000000007593f43 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:739
#6  0x0000000007593a02 in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25ebb330, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:643
#7  0x000000000746dc6c in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x262983c0, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#8  0x00000000074f0edc in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff8fd0, params=..., op=..., B=..., C=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#9  0x000000000746edd1 in LOCA::BorderedSolver::Bordering::applyInverse (this=0x262a1820, params=..., F=0x262a0bf0, G=0x2629e340, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#10 0x000000000741e854 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x26296e10, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#11 0x000000000741d8fc in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x26296e10, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#12 0x0000000007427fb2 in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x262925f8, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#13 0x00000000075db2b6 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#14 0x00000000075f8e5e in NOX::Direction::Generic::compute (this=0x262b4630, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#15 0x00000000075db684 in NOX::Direction::Newton::compute (this=0x262b4630, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#16 0x000000000752678c in NOX::Solver::LineSearchBased::step (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#17 0x0000000007526b3f in NOX::Solver::LineSearchBased::solve (this=0x262a1f70) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:234
#18 0x000000000738415f in LOCA::Stepper::start (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:372
#19 0x0000000007403f29 in LOCA::Abstract::Iterator::run (this=0x2628efd0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:122
#20 0x0000000006955c74 in Piro::LOCASolver::evalModelImpl (this=0x25dc2aa0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#21 0x00000000047dc033 in Thyra::ModelEvaluatorDefaultBase::evalModel (this=0x25dc2c30, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#22 0x000000000463bfc9 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775
SOLVE FAILURE: std::exception

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 14.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[Thread 0x7fffed50e700 (LWP 16865) exited]
[Thread 0x7ffff7fde880 (LWP 16855) exited]
[Inferior 1 (process 16855) exited with code 016]
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-21.el7.x86_64 glibc-2.17-196.el7_4.2.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libcurl-7.29.0-42.el7_4.1.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-11.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 nspr-4.13.1-1.0.el7_3.x86_64 nss-3.28.4-15.el7_4.x86_64 nss-softokn-freebl-3.28.3-8.el7_4.x86_64 nss-util-3.28.4-3.el7.x86_64 openldap-2.4.44-5.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 pcre-8.32-17.el7.x86_64
(gdb)


I'm afraid this leaves me puzzled. Where is this std::exception being thrown?

@rppawlo, @eric-c-cyr, @etphipp

jmgate commented 6 years ago

I think I see what's going on here. It looks like there are two other occurrences of solverPtr->solve() in LOCA_Stepper.C that should perhaps be wrapped in the same sort of try/catch block that was our original solution to this problem. I'll try that out, and if it works I'll submit a PR against Trilinos.

jmgate commented 6 years ago

@etphipp, whenever you're in, could you give me a rundown of how this Stepper class works?

etphipp commented 6 years ago

I’ll be back in the office next week. We can talk then.

-Eric

On Mar 9, 2018, at 3:59 AM, Jason M. Gates notifications@github.com<mailto:notifications@github.com> wrote:

@etphipphttps://github.com/etphipp, whenever you're in, could you give me a rundown of how this Stepper class works?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/trilinos/Trilinos/issues/1608#issuecomment-371587962, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJBpUGeMSbx1A4ZSjnigfjvFi5R3b4Kfks5tcX-RgaJpZM4O3t8k.

jmgate commented 6 years ago

Apparently I was running this case from a restart incorrectly. I fired off the restart run correctly, with what I'm hoping are the same parameters that would've existed at the end of the failed run, and this has been running fine for days now. I'm afraid I'm going to have to try the original run from scratch in gdb and see what happens, but that'll take another few days to get to the point of failure.

jmgate commented 6 years ago

I'm afraid I can't debug this case. Running gdb on eight cores, we get a MPI_ABORT almost immediately. Running gdb on a single core, the simulation never triggers the exception that's thrown running on eight cores without gdb. Instead LOCA starts the continuation run, eventually starts decreasing the step size until it hits the minimum step size, and then gives up. Without the ability to reproduce this exception in a debugger, I don't think there's anything we can do about it.

jmgate commented 6 years ago

So I FINALLY was able to reproduce this problem in a debugger.

Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x2957c920, tinfo=0xc4e0340 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4a7a5c0 <_ZNSt9exceptionD1Ev@plt>) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
75      {
#0  __cxxabiv1::__cxa_throw (obj=0x2957c920, tinfo=0xc4e0340 <_ZTISt9exception@@GLIBCXX_3.4>, dest=0x4a7a5c0 <_ZNSt9exceptionD1Ev@plt>) at ../../../../../src/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:75
#1  0x000000000838c7a4 in Teko::rebuildInverse (factory=..., A=..., invA=...) at /workspace/Trilinos/packages/teko/src/Teko_InverseFactory.cpp:193
#2  0x00000000083d32f2 in Teko::StratimikosFactory::initializePrec_Thyra (this=0x22ec4e00, fwdOpSrc=..., prec=0x25b8bb20, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:223
#3  0x00000000083d28ee in Teko::StratimikosFactory::initializePrec (this=0x22ec4e00, fwdOpSrc=..., prec=0x25b8bb20, supportSolveUse=Thyra::SUPPORT_SOLVE_UNSPECIFIED) at /workspace/Trilinos/packages/teko/src/Teko_StratimikosFactory.cpp:139
#4  0x0000000007ad30aa in NOX::Thyra::Group::updateLOWS (this=0x25559df0) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:906
#5  0x0000000007ad17ed in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25559df0, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:752
#6  0x0000000007ad12ac in NOX::Thyra::Group::applyJacobianInverseMultiVector (this=0x25559df0, p=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-thyra/NOX_Thyra_Group.C:656
#7  0x00000000079aa668 in LOCA::BorderedSolver::JacobianOperator::applyInverse (this=0x29f0a450, params=..., B=..., X=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_JacobianOperator.C:94
#8  0x0000000007a2d8d8 in LOCA::BorderedSolver::LowerTriangularBlockElimination::solve (this=0x7fffffff9190, params=..., op=..., B=..., C=..., F=0x29f4d740, G=0x29f03e90, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_LowerTriangularBlockElimination.C:100
#9  0x00000000079ab7cd in LOCA::BorderedSolver::Bordering::applyInverse (this=0x29f4e4c0, params=..., F=0x29f4d740, G=0x29f03e90, X=..., Y=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_BorderedSolver_Bordering.C:208
#10 0x000000000795b250 in LOCA::MultiContinuation::ConstrainedGroup::applyJacobianInverseMultiVector (this=0x29f089d0, params=..., input=..., result=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:674
#11 0x000000000795a2f8 in LOCA::MultiContinuation::ConstrainedGroup::computeNewton (this=0x29f089d0, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ConstrainedGroup.C:481
#12 0x00000000079649ae in LOCA::MultiContinuation::ExtendedGroup::computeNewton (this=0x2ae3e088, params=...) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_MultiContinuation_ExtendedGroup.C:148
#13 0x0000000007b19330 in NOX::Direction::Newton::compute (this=0x25b941f0, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:136
#14 0x0000000007b36ed8 in NOX::Direction::Generic::compute (this=0x25b941f0, d=..., g=..., s=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Generic.C:59
#15 0x0000000007b196fe in NOX::Direction::Newton::compute (this=0x25b941f0, dir=..., soln=..., solver=...) at /workspace/Trilinos/packages/nox/src/NOX_Direction_Newton.C:162
#16 0x0000000007a63188 in NOX::Solver::LineSearchBased::step (this=0x2c402a60) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:175
#17 0x0000000007a6356f in NOX::Solver::LineSearchBased::solve (this=0x2c402a60) at /workspace/Trilinos/packages/nox/src/NOX_Solver_LineSearchBased.C:236
#18 0x00000000078c242a in LOCA::Stepper::compute (this=0x25b8dfb0, stepStatus=LOCA::Abstract::Iterator::Successful) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Stepper.C:594
#19 0x0000000007940a29 in LOCA::Abstract::Iterator::iterate (this=0x25b8dfb0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:150
#20 0x000000000794096a in LOCA::Abstract::Iterator::run (this=0x25b8dfb0) at /workspace/Trilinos/packages/nox/src-loca/src/LOCA_Abstract_Iterator.C:128
#21 0x0000000006e5472e in Piro::LOCASolver<double>::evalModelImpl (this=0x25461560, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/piro/src/Piro_LOCASolver_Def.hpp:197
#22 0x0000000004c1e84b in Thyra::ModelEvaluatorDefaultBase<double>::evalModel (this=0x254616f0, inArgs=..., outArgs=...) at /workspace/Trilinos/packages/thyra/core/src/support/nonlinear/model_evaluator/client_support/Thyra_ModelEvaluatorDefaultBase.hpp:685
#23 0x0000000004a7e769 in main (argc=3, argv=0x7fffffffc6b8) at /workspace/Trilinos/tcad-charon/driver/Charon_Main.cpp:775

It looks like Teko::rebuildInverse() is throwing the std::exception. Since LOCA_Stepper.C is only catching NOX::Exceptions::SolverFailures, this std::exception makes it through and winds up killing Charon. I could try the following:

diff --git a/packages/teko/src/Teko_StratimikosFactory.cpp b/packages/teko/src/Teko_StratimikosFactory.cpp
index 182819d..d3761c5 100644
--- a/packages/teko/src/Teko_StratimikosFactory.cpp
+++ b/packages/teko/src/Teko_StratimikosFactory.cpp
@@ -1,3 +1,5 @@
+#include "NOX_Exceptions.H"
+
 #include "Teko_StratimikosFactory.hpp"

 #include "Teuchos_Time.hpp"
@@ -220,7 +222,17 @@ void StratimikosFactory::initializePrec_Thyra(
      if(prec_Op==Teuchos::null)
         prec_Op = Teko::buildInverse(*invFactory_,fwdOp);
      else
-        Teko::rebuildInverse(*invFactory_,fwdOp,prec_Op);
+     {
+        try
+        {
+           Teko::rebuildInverse(*invFactory_,fwdOp,prec_Op);
+        }
+        catch (...)
+           TEUCHOS_TEST_FOR_EXCEPTION(true, NOX::Exceptions::SolverFailure,
+              "StratimikosFactory::initializePrec_Thyra():  I'm afraid "      \
+              "something went wrong in Teko::rebuildInverse().  Treating "    \
+              "this as a solver failure.")
+     }
   }

   // construct preconditioner

but that introduces a NOX dependency into Teko that isn't otherwise there. Alternatively, we could try the following in NOX:

diff --git a/packages/nox/src-thyra/NOX_Thyra_Group.C b/packages/nox/src-thyra/NOX_Thyra_Group.C
index 0fc4542..08225ec 100644
--- a/packages/nox/src-thyra/NOX_Thyra_Group.C
+++ b/packages/nox/src-thyra/NOX_Thyra_Group.C
@@ -68,6 +68,7 @@
 #include "NOX_Abstract_MultiVector.H"
 #include "NOX_Thyra_MultiVector.H"
 #include "NOX_Assert.H"
+#include "NOX_Exceptions.H"

 NOX::Thyra::Group::
 Group(const NOX::Thyra::Vector& initial_guess,
@@ -890,6 +890,7 @@ void NOX::Thyra::Group::updateLOWS() const

   this->scaleResidualAndJacobian();

+  try
   {
     NOX_FUNC_TIME_MONITOR("NOX Total Preconditioner Construction");

@@ -932,6 +933,10 @@ void NOX::Thyra::Group::updateLOWS() const
     }

   }
+  catch (...)
+    TEUCHOS_TEST_FOR_EXCEPTION(true, Exceptions::SolverFailure,
+      "NOX::Thyra::Group::updateLOWS():  I'm afraid something went wrong in " \
+      "creating the preconditioner.  Treating this as a solver failure.")

   this->unscaleResidualAndJacobian();

@etphipp, @eric-c-cyr, is that an acceptable solution?

Failing that, at this point I think my time would be better spent writing a Python wrapper around Charon that'll detect failures in the midst of a LOCA run and restart. Probably should've done that months ago.

etphipp commented 6 years ago

The first approach I think you just can't do, because NOX (indirectly) depends on Teko and you can't have circular dependencies.

I can live with the second approach, but it seems far from ideal. What if a real failure happens that shouldn't be treated as just a solver failure. I think the best solution would be do have a set of exceptions that are independent of Thyra, NOX, Teko, ... that capture this case. Maybe such a thing could be in Teuchos?

jmgate commented 6 years ago

Seems reasonable. Who can we talk to in @trilinos/teuchos about this?

bartlettroscoe commented 6 years ago

@etphipp said:

I think the best solution would be do have a set of exceptions that are independent of Thyra, NOX, Teko, ... that capture this case. Maybe such a thing could be in Teuchos?

@jmgate said:

Seems reasonable. Who can we talk to in @trilinos/teuchos about this?

That seems reasonable. The question is which Teuchos subpackage would they go in? What is the set of exception classes being proposed?

etphipp commented 6 years ago

At this point I think we are talking about just one exception that represents a numerical solver failure (e.g., preconditioner applied to a singular matrix), although one could imagine others.

bartlettroscoe commented 6 years ago

I think the best option is the break the solver interfaces currently in the TeuchosReminder subpackage:

packages/teuchos/remainder/src/Trilinos_Details_LinearSolverFactory.cpp
packages/teuchos/remainder/src/Trilinos_Details_LinearSolverFactory.hpp
packages/teuchos/remainder/src/Trilinos_Details_LinearSolver.hpp

and create a new Teuchos subpackage TeuchosSolverInterfaces to move these interfaces into and then create the file:

packages/teuchos/solver_interfaces/src/Teuchos_SolverExceptions.hpp

This would be killing two birds with one stone (i.e. moving these solver interfaces into a logical subpackage and provide a place for some generic solver exceptions).

We would then derive some of the exceptions in Thyra, NOX, and other packages from these exceptions.

@mhoemmen, what do you think about this idea?

jmgate commented 5 years ago

Ping. Charon still has an open issue waiting on the resolution of this one.

What exactly is the procedure to follow for an application to request something from Trilinos and have it actually happen?

mhoemmen commented 5 years ago

@bartlettroscoe I didn't see that earlier message of yours. I'm OK with the plan that you proposed, to move the solver interface stuff in TeuchosRemainder into a new subpackage, TeuchosSolverInterfaces.

@jmgate wrote:

What exactly is the procedure to follow for an application to request something from Trilinos and have it actually happen?

Best practice currently is to have a Trilinos developer on your team.

mhoemmen commented 5 years ago

PR #3983 adds the new exception class.