sandialabs / Albany

Sandia National Laboratories' Albany multiphysics code
279 stars 89 forks source link

Fix failing nightly tests in dashboards! #398

Closed ikalash closed 5 years ago

ikalash commented 5 years ago

This issue is a reincarnation of issue #61.

Per the discussion at yesterday's Albany meeting, I have compiled a spreadsheet with a list of all the tests currently failing in the Albany dashboards (attached). There is a fair bit of information here, namely for each test you can see 1.) what nightly it is failing in, and 2.) how it is failing. It is interesting that not all the tests fail everywhere, and not all the tests fail in the same way across all architectures. Here is a list of the failing tests.

ATO:RegHeaviside_3D ATOT:RegHeaviside_3D CrystalPlasticity_DislocationDensityHardening CrystalPlasticity_MinisolverStep_Newton CrystalPlasticity_MinisolverStep_NewtonLineSearch CrystalPlasticity_MiniSolverStep_TrustRegion CrystalPlasticity_MultiFamily CrystalPlasticity_MultiSlipHard_Implicit CrystalPlasticity_MultiSlipHard_Implicit_Active_Sets CrystalPlasticity_OrientationNotOnMesh CrystalPlasticity_OrientationNotOnMesh_np4 CrystalPlasticity_OrientationOnMesh CrystalPlasticity_OrientationOnMesh_np4 CrystalPlasticity_QuadSlipDislocationDensityTraction CrystalPlasticity_SchwarzBar_modified_np1 CrystalPlasticity_SingleElement2d_ElasticShear2d CrystalPlasticity_SingleElement2d_PlasticShear2d CrystalPlasticity_SingleElement3d_ElasticShear3d CrystalPlasticity_SingleElement3d_ElasticShearRotated3d CrystalPlasticity_SingleSlip_Explicit CrystalPlasticity_SingleSlip_Implicit CrystalPlasticity_SingleSlipHard_Explicit CrystalPlasticity_SingleSlipHard_Implicit CrystalPlasticity_SingleSlipSaturation CrystalPlasticity_ThermallyActivatedSlip Dynamic_ClampedSDBC_NewmarkExplicitAForm_BLMesh_Tempus Dynamics Dynamics_SCOREC_Adapt_Tpetra Dynamics_SCOREC_Tpetra Elasticity3DPressureBC Enthalpy FO_GIS_GisCoupledThicknessTpetra FO_GIS_GisSensSMBwrtBetaTpetra Heat3DPUMI_Tpetra_RegressFail HeliumODEs_HeBubbles HeliumODEs_HeBubblesDecay HydrogenKfieldBC LinComprNS_2DUnvteadyInvPressPulse Mechanics_PlasticityJ2_2D_Traction Mechanics_PlasticityJ2_3D_Traction Mechanics_PorePressureParallelFlow_Serial Mechanics_PorePressureSimple_Serial Mechanics2D_J2 MechanicsPorePressureLocalized_Serial MechanicsTensileCT MechanicsWithHelium_JustMechanics MechanicsWithHelium_MechanicsAndHelium MechanicsWithHelium_MechanicsAndHeliumV2 MechanicsWithHelium_MechanicsAndHydrogen MechanicsWithHelium_MechanicsAndHydrogenV2 MechanicsWithHydrogen_SERIAL MechanicsWithHydrogenBar_no_stabilization MechanicsWithHydrogenBar_requires_stabilization MechanicsWithHydrogenOrthogonal_SERIAL MechanicsWithHydrogenParallel_SERIAL MechanicsWithTemperatureLinearThermalExpansion MechWithHydrogenFastPath_channel_diffusion NSVortexShedding2D_TransIRK_Tpetra Parallel_Dynamic_Cubes_Newmark_Piro Pressure_hex8 Pressure_hex8_tip Pressure_hex8_trac Pressure_tetra10 Pressure_tetra10_tip Pressure_tetra10_trac Pressure_tetra4 Pressure_tetra4_tip Pressure_tetra4_trac RigidBody Schwarz_Alternating_Dynamics_CubesInelastic SCOREC_BimetallicStrip_Traction_Tpetra SCOREC_ElastAdapt_Necking_SERIAL_Necking_SERIAL_Tpetra SCOREC_ElastAdapt_Necking_Tpetra SCOREC_ElastAdapt_SPR_Tpetra_postParma SCOREC_ElastAdapt_SPR_Tpetra_postZoltan SCOREC_ElastAdaptSPR_Tpetra SCOREC_Elasticity_Necking_Tpetra SCOREC_Elasticity_NeckT_SM SCOREC_Elasticity_Rename_Tpetra SCOREC_Elasticity_TracT_SM SCOREC_Elasticity_Traction_Tpetra SCOREC_J2Adapt_Tpetra SCOREC_J2Adapt_Verification_Tpetra SCOREC_J2Tet10_Tpetra SCOREC_MechWithTemp_Tpetra SCOREC_MechWithTemp_Unif_Tpetra SCOREC_Restart_NoRestartT SCOREC_Restart_RestartFromFileT SCOREC_Restart_WriteRestartT SCOREC_ThermoMechanicalCan_mech_tpetra SCOREC_ThermoMechanicalCan_thermomech_tpetra SCOREC_ThermoMechanicalCan_timedep_thermomech_tpetra Serial_Dynamic_Cubes_Newmark_Piro StaticElasticity2D_Traction StaticElasticity3D_Traction SteadyHeat2D SteadyHeat2DRobin_Tpetra SteadyHeat2DSS_dudxdudy_Tpetra SteadyHeatConstrainedOpt2D_Dirichlet_Mixed_ParamsT StrongDBC ThermoMechanicalCan_mech ThermoMechanicalCan_thermomech TimeDependentSDBC

It is a lot... you can see with this many tests failing it is difficult to keep track of new things that might get broken in the code.

Some notes summarizing the attached spreadsheet:

In addition to the above failures, the Peridigm build fails to build. @mperego is aware of this and a decision to fix or drop Peridigm will be made sometime in January.

How should we proceed? Should people check the spreadsheet and claim some tests to fix? The good news is I think if one issue is fixed, a lot of the tests will be fixed (for instance, I suspect the LCM FPE tests all suffer from the same problem).


ikalash commented 5 years ago

A few more things:

jewatkins commented 5 years ago

Thanks for creating this list Irina. I have a few suggestions for preventing this sort of thing from happening again:

  1. Is it possible to narrow down the nightly tests to a few specific builds? (e.g. gcc, intel, clang, cuda)
  2. Can we make these builds easily accessible to developers (at least to sandians) through sems modules or cee machines?
  3. Can we create a list of active developers for each package so that we can ping when something requires a large commitment.

It'd be nice to create a single wiki page for all of this.

gahansen commented 5 years ago

I suspect that the AlbanyIntel test is testing options that other tests do not. I see a variety of issues - this test enables FPE checking so some tests are failing due to that. It also tests the RPI SCOREC adaptive meshing capabilities - some of these are failing with an error like:

Throw test that evaluated to true: (lclNumRows != A.getLocalLength ())

The Crystal Plasticity tests are diffing.

I'm not sure what to suggest here - I should probably turn my tests over to others if they still have value as my available time to stay on top of these has vanished. I like @jewatkins idea to combine similar tests into single tests with a superset of the features that are of value to the overall project. I'll volunteer to remove mine if one of the other tests would like to absorb any features of value in them.

Do we want to check for FPE's in our test suite? I personally would like to have the option to debug using FPE's, but some tests have so many that this ability is no longer there. A couple of us cleaned up all the FPEs in the code a while back - but many have crept back in.

bartgol commented 5 years ago

I think we can give it a try to turn on FPEs in our nightlies. If we get too many red herrings, then we turn them off, otherwise we keep them. Looking at the FPEs enabled in main, I think we should only catch meaningful ones.

ikalash commented 5 years ago

Thanks for the comments thus far. A few replies / follow on comments:

1.) I think it's probably a good idea for someone to take over @gahansen 's tests and monitor / police them regularly. I am somewhat wary about volunteering for this as I already own / monitor 10 nightly tests of Albany... when things get broken, trying to figure out what broke and trying to fix it or finding who to contact to fix it does get time consuming.

2.) I am wary about sweeping the FPE failures under the rug for two reasons:

3.) I could temporarily turn on FPE checking in one of my nightly tests to see if the behavior obtained is similar to Glen's tests.

4.) I see some value in keeping a debug build on the dashboard. I believe @lxmota used to have some debug builds but it appears they are gone - Alejandro, are those gone b/c they were on some of the machines you used to have that are no longer up (procyon, antares, etc.).

5.) When I spoke with Jerry this morning, he mentioned that for every failure there are 3 people who could try to fix it: the owner of the dashboard, the owner of the package, and/or the author of the commit that broke the tests. I would say the author of the commit is the natural person to fix things in general, who is likely to be identified by the owner of the dashboard - this is fairly straightforward if bugs are caught shortly after being pushed. In the case that we have so many failures that have been happening for months, and the failures could be due to a number of large Albany refactors and/or Trilinos changes, I am not sure how to best proceed with fixing the failures... I'm afraid the package owners may not have the bandwidth to do this in a timely fashion given the nature / number of the failures.

gahansen commented 5 years ago

One other unique feature of the CEE tests that I own is the use of MKL for Blas/Lapack. That could account for some of the diffs seen there. I'd be glad to walk through these tests for anyone considering harvesting them (or just taking them over).

The last dashboard "greenification" might be instructive for us - here is the closed issue

Here is one where we tracked down a few of the FPEs at the time

I have had FPE checking on in these tests for much of my tenure on the Albany project - it was active last summer. Many of these new FPEs have snuck in since the summer. The Albany64BitDbg test, however, is a new test added Feb 1

lxmota commented 5 years ago

I stopped running the debug builds because they were never clean and flagged bugs that were in either Trilinos or non-LCM-Albany that remained there for years.

Yes, I used to run them from other machines, but they could be easily turned on again on Algol && Proxima.

Antares ran nightly tests on Ubuntu, but that build is no longer used, so I stopped running those as well.


4.) I see some value in keeping a debug build on the dashboard. I believe @lxmota used to have some debug builds but it appears they are gone - Alejandro, are those gone b/c they were on some of the machines you used to have that are no longer up (procyon, antares, etc.).

bartgol commented 5 years ago

@ikalash I fixed some errors, that were due to the merge of #356 . The fixes will be merged with #396 . I would suggest to merge that asap, to see how much gets fixed.

Note: that PR still has 1 test failing, but I think it is due to something else (perhaps changes in trilinos?). The error looks like

p=0: *** Caught standard std::exception of type 'std::logic_error' :


 Throw number = 1 

 Throw test that evaluated to true: Teuchos::is_null(dec)

 Underlying model in trapezoid decorator does not cast to a Piro::TransientDecorator<Scalar, LocalOrdinal, GlobalOrdinal, Node>
ikalash commented 5 years ago

Which test is this? It looks like it's a test using the trapezoidal rule time integrator for 2nd order in time ODEs written by @gahansen originally in Piro. The thing to do now is to probably switch to the Tempus integrators instead of going through this code path, but I think some tests still use it (SCOREC ones perhaps). I find it unlikely that anyone has touched that Piro code recently.

bartgol commented 5 years ago

It's Dynamics, from LCM.

ikalash commented 5 years ago

Hmmm. I would say (and @amota will agree) that we don't care about the 2nd order integrators in Piro and should just switch over all the tests to Tempus that do not use it already. The RPI folks might have some need for the Piro integrators but I would say in core LCM applications, we do not. I'd still be interested to understand why this is failing all of a sudden.

I think lets merge your stuff in @bartgol and see how much it fixes.

I talked with @amota today and thought about it, and I think it is worthwhile to understand why the FPE and other failures started, and to try to fix them. I guess I am volunteering to look at this since I do not think anyone else will... I can start by adding an FPE check on build to my own nightlies. I am hoping some of the issues can be thrown to Trilinos folks based on the errors within Trilinos, and that gdb can point to the locations of the FPEs... but maybe it is harder than this.

I would propose @bartgol for you to push all your remaining refactor stuff to master when it's ready, then I can look at debugging the tests. Maybe waiting until the meeting next week is good so we can priorities tests / builds is best.

ikalash commented 5 years ago

I created a build on my Fedora 28 workstation which uses gcc-8.2.1 with FPE check = ON in Albany. It is interesting that only 3 tests fail:

Elasticity3DPressureBC Pressure_tetra4 Pressure_tetra10

( So it seems the numerous failures in @gahansen 's build are somewhat dependent on the compiler (Clang, Intel)?

bartgol commented 5 years ago

That's bad, since it forces us to debug with specific compilers. And by the way, I care about clang and intel builds more: the former because it's the most standard-compliant compiler, and the latter because it's the one you would use on a cpu performance run.

ikalash commented 5 years ago

Right. I agree that clang and intel should not be overlooked. I think a lot of folks care about Intel in particular. @gahansen are you able to call in to the meeting next Monday? I think it would be good for you to be there to discuss / make a decision about how to proceed with the testing (what testing to do, who to pass your tests off to, what to try to fix).

ikalash commented 5 years ago

I have created an all-debug build on my machine with FPEs enabled. It will start nightly tomorrow. For one of the tests that failed with FPEs, I ran gdb and here is the error:

Thread 1 "AlbanyT" received signal SIGFPE, Arithmetic exception.
0x00007ffff6a2d53d in Intrepid2::Kernels::inv_scalar_mult_mat<Kokkos::DynRankView<double, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> >, double, Kokkos::DynRankView<double, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> > > (B=..., alpha=0, A=...)
    at /home/ikalash/nightlyAlbanyTests/Results/Trilinos/build-dbg/install/include/Intrepid2_Kernels.hpp:514
514           A(i,j) = B(i,j)/alpha;

I think the problem is alpha can be 0, so there is a division by 0 in Intrepid2. This may explain why in Glen's debug build, an exception was being thrown in Intrepid2. @mperego can you please look into this / help to get it fixed? Should I open a Trilinos issue?

ikalash commented 5 years ago

Here is another cause of FPEs, this time in Sacado:

Thread 1 "Albany" received signal SIGFPE, Arithmetic exception.
0x00007fffef89374c in Sacado::Fad::Expr<Sacado::Fad::PowerOp<Sacado::Fad::Expr<Sacado::Fad::SubtractionOp<Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::Expr<Sacado::Fad::GeneralFad<double, Sacado::Fad::DynamicStorage<double, double> >, Sacado::Fad::ExprSpecDefault> >, Sacado::Fad::ExprSpecDefault>, Sacado::Fad::ConstExpr<double> >, Sacado::Fad::ExprSpecDefault>::fastAccessDx (this=0x7fffffff2e30, i=0)
    at /home/ikalash/nightlyAlbanyTests/Results/Trilinos/build-dbg/install/include/Sacado_Fad_Ops.hpp:655

( Looks like there is another possible division by 0. Probably I should open a Trilinos issue for this too.

bartgol commented 5 years ago

Does this have to do with us feeding bad inputs to Intrepid2/Sacado or is it an internal error in those packages? In the former case, we should check why out inputs are buggy, while in the second case it's a Trilinos issue.

By the way, how can the test be fine in RELEASE mode if we have a division by 0?!?

ikalash commented 5 years ago

@bartgol : I agree with you, we should investigate more if it's Trilinos or our usage of Trilinos. Regarding the Intrepid2 issue in particular: from the dashboard it appears the problem started on 11/9. We need to check if anything was pushed to Albany that day that would have started the issue.

Regarding what happens in release mode: the tests DO die in some builds if you look at my spreadsheet - the Intel and Clang ones in particular. I think behavior with FPEs can depend on the compiler. Depending on where the NaN is and what is done with it, the code can actually run to completion with some compiler despite there being an FPE.

bartgol commented 5 years ago

Make sense.

mperego commented 5 years ago

@ikalash is the Intrepid2 issue still there? the one associated to this error message: Thread 1 "AlbanyT" received signal SIGFPE, Arithmetic exception. 0x00007ffff6a2d53d in Intrepid2::Kernels::inv_scalar_mult_mat<Kokkos::DynRankView<double, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> >, double, Kokkos::DynRankView<double, Kokkos::LayoutStride, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, Kokkos::MemoryTraits<0> > > (B=..., alpha=0, A=...) at /home/ikalash/nightlyAlbanyTests/Results/Trilinos/build-dbg/install/include/Intrepid2_Kernels.hpp:514 514 A(i,j) = B(i,j)/alpha;

ikalash commented 5 years ago

@mperego : I believe my commit last night fixed that error.

mperego commented 5 years ago

@ikalash thanks!

ikalash commented 5 years ago

Things are reasonably clean finally, so closing this issue. There are still a couple failing tests in some platforms, but separate issues to address those exist.