Closed ikalash closed 5 years ago
There are actually 3 of these funky options: ENABLE_MESH_DEPENDS_ON_PARAMETERS
, ENABLE_MESH_DEPENDS_ON_SOLUTION
, and ENABLE_PARAMETERS_DEPEND_ON_SOLUTION
. I think we should enable all of them. Just to be safe.
As of now I think the most important option to be tested is ENABLE_MESH_DEPENDS_ON_PARAMETERS
.
Then we could have a build with ENABLE_MESH_DEPENDS_ON_SOLUTION
and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION
on at the same time.
Are you thinking ultimately 1 build with all these options, or separate builds?
Separate.
One build with ENABLE_MESH_DEPENDS_ON_PARAMETERS
on.
Another build with ENABLE_MESH_DEPENDS_ON_SOLUTION
and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION
on simultaneously.
I see. In this case we probably will need to add a build/builds. This would be a good thing to discuss while I'm in town in January.
Yes. But for now I would modify the Albany64BitClang to have ENABLE_MESH_DEPENDS_ON_PARAMETERS
on.
I agree, I'll do that once Luca's PR is in master and we have ascertained that the build is clean w/o the option.
@ikalash, Thanks!
I think things have stabilized enough where we can try to turn on the
-DENABLE_MESH_DEPENDS_ON_PARAMETERS=ON
In the Albany64BitClang build and see what happens. @mperego , @bartgol : can you please confirm?
go for it! (fingers crossed)
Yes. Enbaling MESH_DEPENDS_ON_SOLUTION should cause one LandIce test to fail, so we should hold off that. This one should be fine.
OK, I'll hold off enabling it. Let me know when the test is fixed.
When you say "This one should be fine", what does "one" refer to?
I meant the option you mentioned, ENABLE_MESH_DEPEND_ON_PARAMETERS.
Coming to think of it, I am not 100% sure it will work. In my tests, the X_DEPEND_ON_Y options were either all ON or all OFF. I'm curious to see if enabling only one is stable.
It looks like 1 test failed after turning on this option, FO_GIS_GisAdjointSensitivityBasalFrictionTpetra:
http://cdash.sandia.gov/CDash-2-3-0/testDetails.php?test=4312117&build=81385
The issue is FPEs. @bartgol , I remember you had opened an issue on FPEs - is this related? I can try to look at it. Most likely you are using std::pow or something operation like this that does not play nicely with Sacado.
Yes, this seems to be related to that. Unfortunately I have no idea how to address this FPE issue in general. This is affecting Albany as a whole, and we should discuss how to address it (I won't be able to make the conf call tomorrow, but we should discuss it there).
One solution is to roll our own pow for Fad, where we don't use the if_then_else machinery that Sacado uses (and ultimately causes the throw), but we simply do a normal if/else. The issue in Sacado is moving slowly at best, and I'm afraid a fix may complicate Sacado's design, so I'm not sure if they will address it the way we need it. At least not in the short term. Hence, the need to roll our own pow.
Please don't roll your own pow operation. No one indicated to me this was any kind of a priority, so I haven't attempted to fix it. It is straightforward to either overload pow() or provide a partial specialization of the expression template for non-simd types that replaces if_then_else() with the ternary operator, which I can do in the next few days.
@etphipp sorry, I assumed it was a bit of a hassle for you to adjust the code to fix this behavior. I don't like to pressure people to fix things so that my code works, especially when the fix on my side is "relatively simple to implement". So I didn't quite follow up with you.
However, if you think you can get to it in the next few days, I'm more than happy to hold off.
@bartgol: I just fixed the FPE for this particular example by replacing pow with multiplication, which was possible.
@etphipp : it would be great if you can fix pow() in Sacado. We had close to 100 tests failing due to FPEs with Clang for LCM, and I narrowed it down to someplace deep within the material model, and all related to operations like pow not playing well with Sacado. I ended up just turning LCM off on that build b/c we do not have the time/resources/manpower to fix the problem right now. It would be great to have the issue be resolved on the Sacado side so we can turn it back on.
@mperego: per your request to have:
Another build with ENABLE_MESH_DEPENDS_ON_SOLUTION
and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION
on simultaneously.
it looks like I already have -DENABLE_PARAMETERS_DEPEND_ON_SOLUTION:BOOL=ON
on in the builds on my workstation camobap. We can probably just add the option -DENABLE_MESH_DEPENDS_ON_SOLUTION:BOOL=ON
to the same build, once @bartgol says it's OK to do so (see above). Please let me know if you disagree.
I think we could turn that option on, but turn off the test FO_GIS_GisCoupledThicknessShapeOpt
. That's the one that fails right now (it didn't before due to multiple evaluators issues). @mperego do you think we can fix that test quickly or should we turn it off for now, so we can start testing the mesh depending on solution right away? I'm for the latter.
I'm not sure when I'll get to fix that test. We might just wait turning on that option until the test is fixed. But I'm fine also disabling the test.
Either way, just let me know how you want to proceed.
Thanks. Let's wait for now. I'll let you know when the test is fixed.
OK sounds good.
Btw, I added a PR in Trilinos that should address the FPE issue:
Thanks @etphipp !!
@etphipp : I see your PR went in.
Unfortunately we are still encountering the FPEs for the FO_GIS problems, and I don't think it is b/c of pow, as I actually removed the pow a few days ago: http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=81604. Here is a backtrace of the error from gdb:
Program received signal SIGSEGV, Segmentation fault.
0x0000000002fba779 in LandIce::L2ProjectedBoundaryLaplacianResidualBase<PHAL::AlbanyTraits::DistParamDeriv, PHAL::AlbanyTraits, Sacado::Fad::DFad<double> >::evaluateFields(PHAL::Workset&) ()
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 glibc-2.12-1.209.el6_9.1.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libcurl-7.19.7-53.el6_9.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libssh2-1.4.2-2.el6_7.1.x86_64 nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.28.4-1.el6_9.x86_64 numactl-2.0.9-2.el6.x86_64 openldap-2.4.40-16.el6.x86_64 openssl-1.0.1e-57.el6.x86_64
#0 0x0000000002fba779 in LandIce::L2ProjectedBoundaryLaplacianResidualBase<PHAL::AlbanyTraits::DistParamDeriv, PHAL::AlbanyTraits, Sacado::Fad::DFad<double> >::evaluateFields(PHAL::Workset&) ()
#1 0x00000000019ec275 in PHX::DagManager<PHAL::AlbanyTraits>::evaluateFields(PHAL::Workset&) ()
#2 0x00000000019cf345 in void PHX::FieldManager<PHAL::AlbanyTraits>::evaluateFields<PHAL::AlbanyTraits::DistParamDeriv>(PHAL::Workset&) ()
#3 0x00000000019c12e9 in Albany::Application::applyGlobalDistParamDerivImpl(double, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::RCP<Thyra::VectorBase<double> const> const&, Teuchos::Array<Sacado::ScalarParameterVector<SPL_Traits> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, Teuchos::RCP<Thyra::MultiVectorBase<double> const> const&, Teuchos::RCP<Thyra::MultiVectorBase<double> > const&) ()
#4 0x0000000001a8234c in Albany::DistributedParameterDerivativeOp::Apply(Epetra_MultiVector const&, Epetra_MultiVector&) const ()
#5 0x000000000378b1f0 in Piro::Epetra::NOXSolver::evalModel(EpetraExt::ModelEvaluator::InArgs const&, EpetraExt::ModelEvaluator::OutArgs const&) const ()
#6 0x0000000005648234 in Thyra::EpetraModelEvaluator::evalModelImpl(Thyra::Mode---Type <return> to continue, or q <return> to quit---
lEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
#7 0x00000000018fc0ca in Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const ()
#8 0x000000000180fcc9 in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::Array<bool> const&, bool, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >) ()
#9 0x000000000180f0f7 in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::ParameterList&, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >) ()
#10 0x0000000001802e97 in main ()
It looks like the problem is with the DistParamDeriv specialization. @mperego, @bartgol: any thoughts?
This is in a non-debug Clang build. Interestingly, when I make the build a debug build, the tests do not seem to fail.
If you decide this is not worth tracking down, we should turn off the tests.
@etphipp : I will turn LCM on in the Clang build that had the FPEs, and see if your PR fixed any of them.
I turned on LCM in the Clang build that had the FPEs, and it looks like there are still a lot of FPEs: http://cdash.sandia.gov/CDash-2-3-0/viewTest.php?onlyfailed&buildid=81604 .
Here is some sample output from gdb:
Program received signal SIGFPE, Arithmetic exception.
0x000000000406dd07 in Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >::operator=<Sacado::Fad::MultiplicationOp<Sacado::Fad::Expr<Sacado::Fad::MultiplicationOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 glibc-2.12-1.209.el6_9.1.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libcurl-7.19.7-53.el6_9.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libssh2-1.4.2-2.el6_7.1.x86_64 nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.28.4-1.el6_9.x86_64 numactl-2.0.9-2.el6.x86_64 openldap-2.4.40-16.el6.x86_64 openssl-1.0.1e-57.el6.x86_64
---Type <return> to continue, or q <return> to quit---
Sacado::Fad::SubtractionOp<Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::DivisionOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> > > ()
I think here Sacado is the culprit, but maybe it's not the pow operation?
I'm at home, so I can't open cdash. From the error message though, I only see a division op, which could be the problem, no pows. If the problem is indeed the division, then it might not be on sacado. Divisions by zero are on us, I think. I don't see anything in sacado's logic that could generate errors even when the inputs are legit (like it happened before with pow).
As for the LandIce errors, those are definitely not FPEs. I'll look into them.
Yes, LCM could be all sorts of things like division by 0. The issue is it's all in the material models which are incredibly complex. I think we do not have the resources currently to look into this (@lxmota please correct me if I am wrong), so I will likely just turn off LCM again in that build.
@bartgol that would be great if you could look into the LandIce errors - thanks!
@ikalash can you post the whole gdb stacktrace up to the line in LCM triggering the error (or post the line number in LCM)? I would just like to see what the expression is. If there is a division by 0, there isn't anything Sacado can do about it.
@etphipp yes, I intended to post the whole stacktrace originally but somehow only a snippet got pasted. Here it is:
************************************************************************
-- Nonlinear Solver Step 0 --
||F|| = 0.000e+00 step = 0.000e+00 dx = 0.000e+00
************************************************************************
Phalanx writing graphviz file for graph of Jacobian fill (detail =1)
Process using 'dot -Tpng -O phalanx_graph'
Program received signal SIGFPE, Arithmetic exception.
0x000000000405caa7 in Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >::operator=<Sacado::Fad::MultiplicationOp<Sacado::Fad::Expr<Sacado::Fad::MultiplicationOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::SubtractionOp<Sacado::Fad::Expr<Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-15.el6_6.2.x86_64 glibc-2.12-1.209.el6_9.1.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64 libcom_err-1.41.12-23.el6.x86_64 libcurl-7.19.7-53.el6_9.x86_64 libidn-1.18-2.el6.x86_64 libselinux-2.0.94-7.el6.x86_64 libssh2-1.4.2-2.el6_7.1.x86_64 nspr-4.13.1-1.el6.x86_64 nss-3.28.4-1.el6_9.x86_64 nss-softokn-freebl-3.14.3-23.3.el6_8.x86_64 nss-util-3.28.4-1.el6_9.x86_64 numactl-2.0.9-2.el6.x86_64 openldap-2.4.40-16.el6.x86_64 openssl-1.0.1e-57.el6.x86_64
---Type <return> to continue, or q <return> to quit---c
Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::DivisionOp<Sacado::Fad::ConstExpr<double>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double const, Sacado::Fad::ViewStorage<double const, 0u, 1u, Sacado::Fad::DFad<double> > >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault> > > ()
This is a non-debug build so there are no line numbers. There is more information in issue #406, which includes some line numbers but they are not particularly helpful - gdb was pointing to a LCM::ConstitutiveModelInterface
call, which all the material models go through, rather than a particular line in the material model. Likely this is the same problem, but I could try a debug build again if you think the output would be helpful.
Looks like again things got cut off... the attachment has all the output - sorry about this. gdb_out.txt
I guess what I was really hoping for was the line number in LCM to see the original expression. But it looks like the only things in the expression are *, -, and /. So division by zero has to be the culprit.
Often these things arise because the initial guess for the nonlinear solver is all zeros, and the residual isn't exactly differentiable there. One thing people have done in the past is put in a simple conditional in their evaluator to check if something is zero before dividing it.
@etphipp : I changed the nightly build scripts to do a debug build tonight, from which I can get the line numbers tomorrow morning. I agree that it is most likely division by 0.
Sorry I’m late to this discussion. I’m on travel until Wednesday.
I was under the impression that material models in LCM had guards against division by zero. I need to look into this and see how pervasive this problem is.
@lxmota : that would be great. The FPEs only show up in the Clang build on the CEE. I can provide you with instructions on how to reproduce the issue. After I've done the backtrace for Eric, I am going to turn the tests off again in the build, as having a large number of failures inevitably leads to new issues getting overlooked in the nightly harness.
Separate. One build with
ENABLE_MESH_DEPENDS_ON_PARAMETERS
on. Another build withENABLE_MESH_DEPENDS_ON_SOLUTION
andENABLE_PARAMETERS_DEPENDS_ON_SOLUTION
on simultaneously.
I'm restarting this discussion since @bartgol reminded me of it in an email. We still do not have any nightlies that have ENABLE_MESH_DEPENDS_ON_SOLUTION
and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION
on simultaneously (or separately for that matter). Do these options work in master now? If so, maybe we should activate them in one of the nightly builds?
IIRC, there's only one test that fails (well, two if you count both E and T versions), and it is triggered by ENABLE_MESH_DEPENDS_ON_SOLUTION
. The other funky options should not yield any failure.
@ikalash I still think we should wait untill I have time to look into that test, or, if you prefer, we can disable that test. I'm fine either ways.
@mperego : ok that's fine with me. I bring this up b/c Luca and I are discussing whether to port the funky mesh options to Thyra. The question is: do we wait till the test is fixed in master, or do we port everything, then try to fix it in the new Thyra code once the unpetrify_branch gets merged in.
@ikalash I chatted with Luca. It's probably best to port everything. Thanks!
@mperego : ok sounds good to me! I'll hold off turning on the options where there is still a failure until that's all fixed.
We do have a build that test this option now (see #497 for a test that failed precisely because of this option).
No - see discussion above. A decision was made a few months ago to hold off on turning these options. If the code is ready to turn them on, I'm happy to do that.
The Clang build does have this option (that's what generated the tests failures in the issue I referenced). We may not have a build with MESH_DEPENDS_ON_SOLUTION
though.
Sorry for the confusion. Yes - I agree. What I was referring to is we have no separate build with ENABLE_MESH_DEPENDS_ON_SOLUTION and ENABLE_PARAMETERS_DEPENDS_ON_SOLUTION on simultaneously, as @mperego wanted to have (see discussion above). That was deferred b/c the option was not expected to work correctly at the time.
Ah, I see. Yes, that option is currently not working (the tests were passing before #412 went in because the wrong evaluator was computing a field). @mperego what are your plans on the coupled FOThickness problem? I don't like to have issues open related to stuff we are not planning on using/fixing anytime soon, so if there's a path forward, we can keep the issue open, and work on it, otherwise, we should simply disallow using the option altogether (we can leave it, but throw an error at config time, to prevent users from inadvertently enable it and potentially cause errors).
I would simply disable that test.
In this build, we should have only LandIce and DemoPDEs enabled, as the option is not guaranteed to work with other physics. A good candidate for such a build is Albany64BitClang which, starting tonight, will only have LandIce and DemoPDEs enables. Lets get LandIce fixed first and make sure all tests run correctly without -DENABLE_MESH_DEPENDS_ON_PARAMETERS=ON before enabling that option.