sandialabs / Albany

Sandia National Laboratories' Albany multiphysics code
Other
282 stars 89 forks source link

Have at least 1 nightly build with -DENABLE_MESH_DEPENDS_ON_PARAMETERS=ON #413

Closed ikalash closed 5 years ago

ikalash commented 5 years ago

In this build, we should have only LandIce and DemoPDEs enabled, as the option is not guaranteed to work with other physics. A good candidate for such a build is Albany64BitClang which, starting tonight, will only have LandIce and DemoPDEs enables. Lets get LandIce fixed first and make sure all tests run correctly without -DENABLE_MESH_DEPENDS_ON_PARAMETERS=ON before enabling that option.

bartgol commented 5 years ago

There are actually 3 of these funky options: ENABLE_MESH_DEPENDS_ON_PARAMETERS, ENABLE_MESH_DEPENDS_ON_SOLUTION, and ENABLE_PARAMETERS_DEPEND_ON_SOLUTION. I think we should enable all of them. Just to be safe.

mperego commented 5 years ago

As of now I think the most important option to be tested is ENABLE_MESH_DEPENDS_ON_PARAMETERS . Then we could have a build with ENABLE_MESH_DEPENDS_ON_SOLUTION and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION on at the same time.

ikalash commented 5 years ago

Are you thinking ultimately 1 build with all these options, or separate builds?

mperego commented 5 years ago

Separate. One build with ENABLE_MESH_DEPENDS_ON_PARAMETERS on. Another build with ENABLE_MESH_DEPENDS_ON_SOLUTION and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION on simultaneously.

ikalash commented 5 years ago

I see. In this case we probably will need to add a build/builds. This would be a good thing to discuss while I'm in town in January.

mperego commented 5 years ago

Yes. But for now I would modify the Albany64BitClang to have ENABLE_MESH_DEPENDS_ON_PARAMETERS on.

ikalash commented 5 years ago

I agree, I'll do that once Luca's PR is in master and we have ascertained that the build is clean w/o the option.

mperego commented 5 years ago

@ikalash, Thanks!

ikalash commented 5 years ago

I think things have stabilized enough where we can try to turn on the

-DENABLE_MESH_DEPENDS_ON_PARAMETERS=ON

In the Albany64BitClang build and see what happens. @mperego , @bartgol : can you please confirm?

mperego commented 5 years ago

go for it! (fingers crossed)

bartgol commented 5 years ago

Yes. Enbaling MESH_DEPENDS_ON_SOLUTION should cause one LandIce test to fail, so we should hold off that. This one should be fine.

ikalash commented 5 years ago

OK, I'll hold off enabling it. Let me know when the test is fixed.

When you say "This one should be fine", what does "one" refer to?

bartgol commented 5 years ago

I meant the option you mentioned, ENABLE_MESH_DEPEND_ON_PARAMETERS.

Coming to think of it, I am not 100% sure it will work. In my tests, the X_DEPEND_ON_Y options were either all ON or all OFF. I'm curious to see if enabling only one is stable.

ikalash commented 5 years ago

It looks like 1 test failed after turning on this option, FO_GIS_GisAdjointSensitivityBasalFrictionTpetra:

http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4312117&build=81385

The issue is FPEs. @bartgol , I remember you had opened an issue on FPEs - is this related? I can try to look at it. Most likely you are using std::pow or something operation like this that does not play nicely with Sacado.

bartgol commented 5 years ago

Yes, this seems to be related to that. Unfortunately I have no idea how to address this FPE issue in general. This is affecting Albany as a whole, and we should discuss how to address it (I won't be able to make the conf call tomorrow, but we should discuss it there).

One solution is to roll our own pow for Fad, where we don't use the if_then_else machinery that Sacado uses (and ultimately causes the throw), but we simply do a normal if/else. The issue in Sacado is moving slowly at best, and I'm afraid a fix may complicate Sacado's design, so I'm not sure if they will address it the way we need it. At least not in the short term. Hence, the need to roll our own pow.

etphipp commented 5 years ago

Please don't roll your own pow operation. No one indicated to me this was any kind of a priority, so I haven't attempted to fix it. It is straightforward to either overload pow() or provide a partial specialization of the expression template for non-simd types that replaces if_then_else() with the ternary operator, which I can do in the next few days.

bartgol commented 5 years ago

@etphipp sorry, I assumed it was a bit of a hassle for you to adjust the code to fix this behavior. I don't like to pressure people to fix things so that my code works, especially when the fix on my side is "relatively simple to implement". So I didn't quite follow up with you.

However, if you think you can get to it in the next few days, I'm more than happy to hold off.

ikalash commented 5 years ago

@bartgol: I just fixed the FPE for this particular example by replacing pow with multiplication, which was possible.

@etphipp : it would be great if you can fix pow() in Sacado. We had close to 100 tests failing due to FPEs with Clang for LCM, and I narrowed it down to someplace deep within the material model, and all related to operations like pow not playing well with Sacado. I ended up just turning LCM off on that build b/c we do not have the time/resources/manpower to fix the problem right now. It would be great to have the issue be resolved on the Sacado side so we can turn it back on.

ikalash commented 5 years ago

@mperego: per your request to have:

Another build with ENABLE_MESH_DEPENDS_ON_SOLUTION and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION on simultaneously.

it looks like I already have -DENABLE_PARAMETERS_DEPEND_ON_SOLUTION:BOOL=ON on in the builds on my workstation camobap. We can probably just add the option -DENABLE_MESH_DEPENDS_ON_SOLUTION:BOOL=ON to the same build, once @bartgol says it's OK to do so (see above). Please let me know if you disagree.

bartgol commented 5 years ago

I think we could turn that option on, but turn off the test FO_GIS_GisCoupledThicknessShapeOpt. That's the one that fails right now (it didn't before due to multiple evaluators issues). @mperego do you think we can fix that test quickly or should we turn it off for now, so we can start testing the mesh depending on solution right away? I'm for the latter.

mperego commented 5 years ago

I'm not sure when I'll get to fix that test. We might just wait turning on that option until the test is fixed. But I'm fine also disabling the test.

ikalash commented 5 years ago

Either way, just let me know how you want to proceed.

mperego commented 5 years ago

Thanks. Let's wait for now. I'll let you know when the test is fixed.

ikalash commented 5 years ago

OK sounds good.

etphipp commented 5 years ago

Btw, I added a PR in Trilinos that should address the FPE issue:

https://github.com/trilinos/Trilinos/pull/4380

bartgol commented 5 years ago

Thanks @etphipp !!

ikalash commented 5 years ago

@etphipp : I see your PR went in.

Unfortunately we are still encountering the FPEs for the FO_GIS problems, and I don't think it is b/c of pow, as I actually removed the pow a few days ago: http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=81604. Here is a backtrace of the error from gdb:

Program received signal SIGSEGV, Segmentation fault.
0x0000000002fba779 in LandIce::L2ProjectedBoundaryLaplacianResidualBase<PHAL::AlbanyTraits::DistParamDeriv, PHAL::AlbanyTraits, Sacado::Fad::DFad<double> >::evaluateFields(PHAL::Workset&) ()
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 glibc-2.12-1.209.el6_9.1.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libcurl-7.19.7-53.el6_9.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libssh2-1.4.2-2.el6_7.1.x86_64 nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.28.4-1.el6_9.x86_64 numactl-2.0.9-2.el6.x86_64 openldap-2.4.40-16.el6.x86_64 openssl-1.0.1e-57.el6.x86_64
#0  0x0000000002fba779 in LandIce::L2ProjectedBoundaryLaplacianResidualBase<PHAL::AlbanyTraits::DistParamDeriv, PHAL::AlbanyTraits, Sacado::Fad::DFad<double> >::evaluateFields(PHAL::Workset&) ()
#1  0x00000000019ec275 in PHX::DagManager<PHAL::AlbanyTraits>::evaluateFields(PHAL::Workset&) ()
#2  0x00000000019cf345 in void PHX::FieldManager<PHAL::AlbanyTraits>::evaluateFields<PHAL::AlbanyTraits::DistParamDeriv>(PHAL::Workset&) ()
#3  0x00000000019c12e9 in Albany::Application::applyGlobalDistParamDerivImpl(double, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::Array<Sacado::ScalarParameterVector<SPL_Traits> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, Teuchos::RCP<Thyra::MultiVectorBase<double> const> const&, Teuchos::RCP<Thyra::MultiVectorBase<double> > const&) ()
#4  0x0000000001a8234c in Albany::DistributedParameterDerivativeOp::Apply(Epetra_MultiVector const&, Epetra_MultiVector&) const ()
#5  0x000000000378b1f0 in Piro::Epetra::NOXSolver::evalModel(EpetraExt::ModelEvaluator::InArgs const&, EpetraExt::ModelEvaluator::OutArgs const&) const ()
#6  0x0000000005648234 in Thyra::EpetraModelEvaluator::evalModelImpl(Thyra::Mode---Type <return> to continue, or q <return> to quit---
lEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
#7  0x00000000018fc0ca in Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
#8  0x000000000180fcc9 in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::Array<bool> const&, bool, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >) ()
#9  0x000000000180f0f7 in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::ParameterList&, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >) ()
#10 0x0000000001802e97 in main ()

It looks like the problem is with the DistParamDeriv specialization. @mperego, @bartgol: any thoughts?

This is in a non-debug Clang build. Interestingly, when I make the build a debug build, the tests do not seem to fail.

If you decide this is not worth tracking down, we should turn off the tests.

@etphipp : I will turn LCM on in the Clang build that had the FPEs, and see if your PR fixed any of them.

ikalash commented 5 years ago

I turned on LCM in the Clang build that had the FPEs, and it looks like there are still a lot of FPEs: http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=81604 .

Here is some sample output from gdb:

Program received signal SIGFPE, Arithmetic exception.
0x000000000406dd07 in Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >::operator=<Sacado::Fad::MultiplicationOp<Sacado::Fad::Expr<Sacado::Fad::MultiplicationOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 glibc-2.12-1.209.el6_9.1.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libcurl-7.19.7-53.el6_9.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libssh2-1.4.2-2.el6_7.1.x86_64 nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.28.4-1.el6_9.x86_64 numactl-2.0.9-2.el6.x86_64 openldap-2.4.40-16.el6.x86_64 openssl-1.0.1e-57.el6.x86_64
---Type <return> to continue, or q <return> to quit---
Sacado::Fad::SubtractionOp<Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::DivisionOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> > > ()

I think here Sacado is the culprit, but maybe it's not the pow operation?

bartgol commented 5 years ago

I'm at home, so I can't open cdash. From the error message though, I only see a division op, which could be the problem, no pows. If the problem is indeed the division, then it might not be on sacado. Divisions by zero are on us, I think. I don't see anything in sacado's logic that could generate errors even when the inputs are legit (like it happened before with pow).

As for the LandIce errors, those are definitely not FPEs. I'll look into them.

ikalash commented 5 years ago

Yes, LCM could be all sorts of things like division by 0. The issue is it's all in the material models which are incredibly complex. I think we do not have the resources currently to look into this (@lxmota please correct me if I am wrong), so I will likely just turn off LCM again in that build.

ikalash commented 5 years ago

@bartgol that would be great if you could look into the LandIce errors - thanks!

etphipp commented 5 years ago

@ikalash can you post the whole gdb stacktrace up to the line in LCM triggering the error (or post the line number in LCM)? I would just like to see what the expression is. If there is a division by 0, there isn't anything Sacado can do about it.

ikalash commented 5 years ago

@etphipp yes, I intended to post the whole stacktrace originally but somehow only a snippet got pasted. Here it is:

************************************************************************
-- Nonlinear Solver Step 0 -- 
||F|| = 0.000e+00  step = 0.000e+00  dx = 0.000e+00
************************************************************************

 Phalanx writing graphviz file for graph of Jacobian fill (detail =1)
 Process using 'dot -Tpng -O phalanx_graph' 

Program received signal SIGFPE, Arithmetic exception.
0x000000000405caa7 in Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >::operator=<Sacado::Fad::MultiplicationOp<Sacado::Fad::Expr<Sacado::Fad::MultiplicationOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::SubtractionOp<Sacado::Fad::Expr<Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 glibc-2.12-1.209.el6_9.1.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libcurl-7.19.7-53.el6_9.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libssh2-1.4.2-2.el6_7.1.x86_64 nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.28.4-1.el6_9.x86_64 numactl-2.0.9-2.el6.x86_64 openldap-2.4.40-16.el6.x86_64 openssl-1.0.1e-57.el6.x86_64
---Type <return> to continue, or q <return> to quit---c
Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::DivisionOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> > > ()

This is a non-debug build so there are no line numbers. There is more information in issue #406, which includes some line numbers but they are not particularly helpful - gdb was pointing to a LCM::ConstitutiveModelInterface call, which all the material models go through, rather than a particular line in the material model. Likely this is the same problem, but I could try a debug build again if you think the output would be helpful.

ikalash commented 5 years ago

Looks like again things got cut off... the attachment has all the output - sorry about this. gdb_out.txt

etphipp commented 5 years ago

I guess what I was really hoping for was the line number in LCM to see the original expression. But it looks like the only things in the expression are *, -, and /. So division by zero has to be the culprit.

Often these things arise because the initial guess for the nonlinear solver is all zeros, and the residual isn't exactly differentiable there. One thing people have done in the past is put in a simple conditional in their evaluator to check if something is zero before dividing it.

ikalash commented 5 years ago

@etphipp : I changed the nightly build scripts to do a debug build tonight, from which I can get the line numbers tomorrow morning. I agree that it is most likely division by 0.

lxmota commented 5 years ago

Sorry I’m late to this discussion. I’m on travel until Wednesday.

I was under the impression that material models in LCM had guards against division by zero. I need to look into this and see how pervasive this problem is.

ikalash commented 5 years ago

@lxmota : that would be great. The FPEs only show up in the Clang build on the CEE. I can provide you with instructions on how to reproduce the issue. After I've done the backtrace for Eric, I am going to turn the tests off again in the build, as having a large number of failures inevitably leads to new issues getting overlooked in the nightly harness.

ikalash commented 5 years ago

Separate. One build with ENABLE_MESH_DEPENDS_ON_PARAMETERS on. Another build with ENABLE_MESH_DEPENDS_ON_SOLUTION and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION on simultaneously.

I'm restarting this discussion since @bartgol reminded me of it in an email. We still do not have any nightlies that have ENABLE_MESH_DEPENDS_ON_SOLUTION and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION on simultaneously (or separately for that matter). Do these options work in master now? If so, maybe we should activate them in one of the nightly builds?

bartgol commented 5 years ago

IIRC, there's only one test that fails (well, two if you count both E and T versions), and it is triggered by ENABLE_MESH_DEPENDS_ON_SOLUTION. The other funky options should not yield any failure.

mperego commented 5 years ago

@ikalash I still think we should wait untill I have time to look into that test, or, if you prefer, we can disable that test. I'm fine either ways.

ikalash commented 5 years ago

@mperego : ok that's fine with me. I bring this up b/c Luca and I are discussing whether to port the funky mesh options to Thyra. The question is: do we wait till the test is fixed in master, or do we port everything, then try to fix it in the new Thyra code once the unpetrify_branch gets merged in.

mperego commented 5 years ago

@ikalash I chatted with Luca. It's probably best to port everything. Thanks!

ikalash commented 5 years ago

@mperego : ok sounds good to me! I'll hold off turning on the options where there is still a failure until that's all fixed.

bartgol commented 5 years ago

We do have a build that test this option now (see #497 for a test that failed precisely because of this option).

ikalash commented 5 years ago

No - see discussion above. A decision was made a few months ago to hold off on turning these options. If the code is ready to turn them on, I'm happy to do that.

bartgol commented 5 years ago

The Clang build does have this option (that's what generated the tests failures in the issue I referenced). We may not have a build with MESH_DEPENDS_ON_SOLUTION though.

ikalash commented 5 years ago

Sorry for the confusion. Yes - I agree. What I was referring to is we have no separate build with ENABLE_MESH_DEPENDS_ON_SOLUTION and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION on simultaneously, as @mperego wanted to have (see discussion above). That was deferred b/c the option was not expected to work correctly at the time.

bartgol commented 5 years ago

Ah, I see. Yes, that option is currently not working (the tests were passing before #412 went in because the wrong evaluator was computing a field). @mperego what are your plans on the coupled FOThickness problem? I don't like to have issues open related to stuff we are not planning on using/fixing anytime soon, so if there's a path forward, we can keep the issue open, and work on it, otherwise, we should simply disallow using the option altogether (we can leave it, but throw an error at config time, to prevent users from inadvertently enable it and potentially cause errors).

mperego commented 5 years ago

I would simply disable that test.