sandialabs / Albany

Sandia National Laboratories' Albany multiphysics code
Other
282 stars 89 forks source link

Failing Tempus adjoint tests #1008

Open ikalash opened 1 year ago

ikalash commented 1 year ago

The demoPDEs tests that use adjoints from Tempus started failing yesterday 11/7:

demoPDEs_Advection1D_Scalar_Param_Adjoint_Sens_Explicit demoPDEs_Advection1D_with_Source_Dist_Param_Adjoint_Sens_Explicit_ConsistentM demoPDEs_Thermal1D_with_Source_Dist_Param_Adjoint_Sens_Explicit

https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=54052

It looks like there is an Amesos2 KLU2 error that happens after the time-integration is complete, it appears due to a messed up matrix that it is given:

p=0: *** Caught standard std::exception of type 'std::runtime_error' :

 /projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:222:

 Throw number = 1

 Throw test that evaluated to true: info > 0

 KLU2 numeric factorization failed

p=3: *** Caught standard std::exception of type 'std::runtime_error' :

 /projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:222:

 Throw number = 1

 Throw test that evaluated to true: info > 0

 KLU2 numeric factorization failed

p=1: *** Caught standard std::exception of type 'std::runtime_error' :

 /projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:222:

 Throw number = 1

 Throw test that evaluated to true: info > 0

 KLU2 numeric factorization failed

p=2: *** Caught standard std::exception of type 'std::runtime_error' :

 /projects/albany/nightlyAlbanyCDash/repos/Trilinos/packages/amesos2/src/Amesos2_KLU2_def.hpp:222:

 Throw number = 1

 Throw test that evaluated to true: info > 0

I am wondering if this is related to recent changes to Tempus. Tagging @ccober6 who might have ideas about this theory.

I will investigate further.

mperego commented 1 year ago

@ikalash I don't think it's Tempus changes, as they were only touching BDF tests. @cgcgcg could it be the changes in MueLU you push two days ago? In these tests we are using MueLu (Stratimikos) default options.

cgcgcg commented 1 year ago

@mperego Yes. I'm trying to track down what exactly happened. EMPIRE is seeing the same issue. I get different results depending on where the factorization is called...

mperego commented 1 year ago

OK. Thanks for looking into that.

ikalash commented 1 year ago

Yes, thanks @cgcgcg ! If you know what is the problem and are working on it, I will not test this further but will wait for your fix.

cgcgcg commented 1 year ago

Could you pull Trilinos develop and check that it works again?

mperego commented 1 year ago

Our nightly tests are based on Trilinos develop. So if it's OK to wait, we'll know tomorrow morning whether the problem has been fixed.

ikalash commented 1 year ago

@cgcgcg : I can test it today. It's easy enough to do. Please stay tuned.

ikalash commented 1 year ago

I have verified that the tests pass now with a new develop Trilinos. Thanks @cgcgcg ! I will close this issue tomorrow once our CDash is clean.

cgcgcg commented 1 year ago

Nice! For now I just reverted the offending commit. We will try to get this change in again at a later date once we understand what went wrong.

cgcgcg commented 12 months ago

With help from @mperego I was able to build Albany and run demoPDEs_Advection1D_Scalar_Param_Adjoint_Sens_Explicit.

Here is what caused the failure:

I printed a stacktrace from the point where the factorization fails:

  *******************************************************
  ***** Belos Iterative Solver:  Block Gmres 
  ***** Maximum Iterations: 3
  ***** Block Size: 1
  ***** Residual Test: 
  *****   Test 1 : Belos::StatusTestImpResNorm<>: (2-Norm Res Vec) / (2-Norm Prec Res0), tol = 0.01
  *******************************************************
  Iter 0, [ 1] :    1.000000e+00
  Iter 1, [ 1] :    1.181981e-16
5000  1.000e+00  2.000e-04  0.000e+00  0.000e+00  1.0    0      1.477e-03  
STKDiscretization::writeSolution: writing time 1.000e+00 to index 501 in file advection1D_scalar_param_adjoint_sens_explicit_out.exo
Time = 1.000e+00
         Response[0] = -4.66293670e-16 
         Response[1] = 3.57106495e+00  
============================================================================
  Total runtime = 8.00135953e+00 sec = 1.33355992e-01 min
Fri Nov 10 19:22:59 2023
Time integration complete.

 Traceback (most recent call last):
   File unknown, in _start()
   File unknown, in __libc_start_main()
   File unknown, in main()
   File unknown, in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::ParameterList&, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&)
   File unknown, in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::Array<bool> const&, bool, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >)
   File unknown, in Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
   File unknown, in Piro::TempusSolver<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
   File unknown, in Tempus::IntegratorAdjointSensitivity<double>::advanceTime(double)
   File unknown, in Piro::InvertMassMatrixDecorator<double>::create_W() const
   File unknown, in void Thyra::initializeOp<double>(Thyra::LinearOpWithSolveFactoryBase<double> const&, Teuchos::RCP<Thyra::LinearOpBase<double> const> const&, Teuchos::Ptr<Thyra::LinearOpWithSolveBase<double> > const&, Thyra::ESupportSolveUse)
   File unknown, in Thyra::BelosLinearOpWithSolveFactory<double>::initializeOp(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Thyra::LinearOpWithSolveBase<double>*, Thyra::ESupportSolveUse) const
   File unknown, in Thyra::BelosLinearOpWithSolveFactory<double>::initializeOpImpl(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Teuchos::RCP<Thyra::PreconditionerBase<double> const> const&, bool, Thyra::LinearOpWithSolveBase<double>*, Thyra::ESupportSolveUse) const
   File unknown, in Thyra::MueLuPreconditionerFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::initializePrec(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Thyra::PreconditionerBase<double>*, Thyra::ESupportSolveUse) const
   File unknown, in Teuchos::RCP<MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > > MueLu::CreateXpetraPreconditioner<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >(Teuchos::RCP<Xpetra::Matrix<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >, Teuchos::ParameterList const&)
   File unknown, in MueLu::HierarchyManager<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::SetupHierarchy(MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >&) const
   File unknown, in MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(int, Teuchos::RCP<MueLu::FactoryManagerBase const>, Teuchos::RCP<MueLu::FactoryManagerBase const>, Teuchos::RCP<MueLu::FactoryManagerBase const>)
   File unknown, in MueLu::TopSmootherFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Build(MueLu::Level&) const
   File unknown, in Teuchos::RCP<MueLu::SmootherBase<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >& MueLu::Level::Get<Teuchos::RCP<MueLu::SmootherBase<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, MueLu::FactoryBase const*)
   File unknown, in MueLu::SingleLevelFactoryBase::CallBuild(MueLu::Level&) const
   File unknown, in MueLu::SmootherFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::BuildSmoother(MueLu::Level&, MueLu::PreOrPost) const
   File unknown, in MueLu::DirectSolver<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(MueLu::Level&)
   File unknown, in MueLu::Amesos2Smoother<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(MueLu::Level&)

The matrix that is 25x25 and has 75 entries which are all identically zero.

ikalash commented 12 months ago

Thanks for digging into this @cgcgcg . I think it makes sense to reopen the issue - do you agree? Unless we want to open a separate Trilinos one.

cgcgcg commented 12 months ago

Sure, let's reopen.

mperego commented 11 months ago

We need to understand why these tests set up a MueLu preconditioner for a singular matrix, and then not use the preconditioner. @ikalash do you have time to look into it?

ikalash commented 11 months ago

We need to understand why these tests set up a MueLu preconditioner for a singular matrix, and then not use the preconditioner. @ikalash do you have time to look into it?

Perhaps I misunderstood what @cgcgcg wrote, but it seems that it is the matrix at the coarsest grid level that is singular. Is that right? If so, would that suggest that there is something wrong with the matrix problem being solved using the AMG?

cgcgcg commented 11 months ago

Sorry, I should have explained better. The problem is so small that this is a one-level method. The matrix is supplied by Albany.

ikalash commented 11 months ago

Sorry, I should have explained better. The problem is so small that this is a one-level method. The matrix is supplied by Albany.

That's interesting. How was it working before? Was it because an iterative solve rather than a direct solve was done? Is there a branch/fork of Trilinos I can use to see the singularity / failure?

cgcgcg commented 11 months ago

The failure was triggered by MueLu switching the factorization of the coarse grid from first solve to setup. So it seems like Albany is constructing the preconditioner, but then doesn't use it to solve a system. I can provide a patch against Trilinos tomorrow morning that triggers the behavior.

ikalash commented 11 months ago

The failure was triggered by MueLu switching the factorization of the coarse grid from first solve to setup. So it seems like Albany is constructing the preconditioner, but then doesn't use it to solve a system. I can provide a patch against Trilinos tomorrow morning that triggers the behavior.

That would be great. I won't get to this until next week so it is no rush.

ikalash commented 10 months ago

I am very sorry but I still haven't had a chance to work on this. Unfortunately I am really swamped right now getting ready for 2 all-hands meetings after the shutdown and working on a few other time-critical things. Does someone else have the time to look at this issue? I can pass along instructions on how to reproduce it from @cgcgcg . Maybe we can discuss this at the Albany meeting tomorrow.

ikalash commented 10 months ago

I forgot to say, I am not sure when I would have a chance to look at this.

ikalash commented 10 months ago

Ok, per the discussion at today's Albany meeting, I switched the problematic tests so that they use Ifpack2 to avoid this issue, allowing @cgcgcg to merge his PR. We can look more at the cause of the issue in the new year when me / others have more time.

@cgcgcg : very sorry for the delay! You should be able to merge your code now that you had reverted earlier due to these test failures.

cgcgcg commented 10 months ago

No problem! Thanks for letting me know!

ikalash commented 10 months ago

Sure! Again, my apologies that it took so long!