Open ikalash opened 1 year ago
@ikalash I don't think it's Tempus changes, as they were only touching BDF tests. @cgcgcg could it be the changes in MueLU you push two days ago? In these tests we are using MueLu (Stratimikos) default options.
@mperego Yes. I'm trying to track down what exactly happened. EMPIRE is seeing the same issue. I get different results depending on where the factorization is called...
OK. Thanks for looking into that.
Yes, thanks @cgcgcg ! If you know what is the problem and are working on it, I will not test this further but will wait for your fix.
Could you pull Trilinos develop and check that it works again?
Our nightly tests are based on Trilinos develop. So if it's OK to wait, we'll know tomorrow morning whether the problem has been fixed.
@cgcgcg : I can test it today. It's easy enough to do. Please stay tuned.
I have verified that the tests pass now with a new develop Trilinos. Thanks @cgcgcg ! I will close this issue tomorrow once our CDash is clean.
Nice! For now I just reverted the offending commit. We will try to get this change in again at a later date once we understand what went wrong.
With help from @mperego I was able to build Albany and run demoPDEs_Advection1D_Scalar_Param_Adjoint_Sens_Explicit
.
Here is what caused the failure:
I printed a stacktrace from the point where the factorization fails:
*******************************************************
***** Belos Iterative Solver: Block Gmres
***** Maximum Iterations: 3
***** Block Size: 1
***** Residual Test:
***** Test 1 : Belos::StatusTestImpResNorm<>: (2-Norm Res Vec) / (2-Norm Prec Res0), tol = 0.01
*******************************************************
Iter 0, [ 1] : 1.000000e+00
Iter 1, [ 1] : 1.181981e-16
5000 1.000e+00 2.000e-04 0.000e+00 0.000e+00 1.0 0 1.477e-03
STKDiscretization::writeSolution: writing time 1.000e+00 to index 501 in file advection1D_scalar_param_adjoint_sens_explicit_out.exo
Time = 1.000e+00
Response[0] = -4.66293670e-16
Response[1] = 3.57106495e+00
============================================================================
Total runtime = 8.00135953e+00 sec = 1.33355992e-01 min
Fri Nov 10 19:22:59 2023
Time integration complete.
Traceback (most recent call last):
File unknown, in _start()
File unknown, in __libc_start_main()
File unknown, in main()
File unknown, in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::ParameterList&, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&)
File unknown, in void Piro::Detail::PerformSolveImpl<double, Thyra::VectorBase<double> const, Thyra::MultiVectorBase<double> const>(Thyra::ModelEvaluator<double> const&, Teuchos::Array<bool> const&, bool, Teuchos::Array<Teuchos::RCP<Thyra::VectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> >&, Teuchos::Array<Teuchos::Array<Teuchos::RCP<Thyra::MultiVectorBase<double> const> > >&, Teuchos::RCP<Piro::SolutionObserverBase<double, Thyra::VectorBase<double> const> >)
File unknown, in Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
File unknown, in Piro::TempusSolver<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
File unknown, in Tempus::IntegratorAdjointSensitivity<double>::advanceTime(double)
File unknown, in Piro::InvertMassMatrixDecorator<double>::create_W() const
File unknown, in void Thyra::initializeOp<double>(Thyra::LinearOpWithSolveFactoryBase<double> const&, Teuchos::RCP<Thyra::LinearOpBase<double> const> const&, Teuchos::Ptr<Thyra::LinearOpWithSolveBase<double> > const&, Thyra::ESupportSolveUse)
File unknown, in Thyra::BelosLinearOpWithSolveFactory<double>::initializeOp(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Thyra::LinearOpWithSolveBase<double>*, Thyra::ESupportSolveUse) const
File unknown, in Thyra::BelosLinearOpWithSolveFactory<double>::initializeOpImpl(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Teuchos::RCP<Thyra::PreconditionerBase<double> const> const&, bool, Thyra::LinearOpWithSolveBase<double>*, Thyra::ESupportSolveUse) const
File unknown, in Thyra::MueLuPreconditionerFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::initializePrec(Teuchos::RCP<Thyra::LinearOpSourceBase<double> const> const&, Thyra::PreconditionerBase<double>*, Thyra::ESupportSolveUse) const
File unknown, in Teuchos::RCP<MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > > MueLu::CreateXpetraPreconditioner<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >(Teuchos::RCP<Xpetra::Matrix<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >, Teuchos::ParameterList const&)
File unknown, in MueLu::HierarchyManager<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::SetupHierarchy(MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >&) const
File unknown, in MueLu::Hierarchy<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(int, Teuchos::RCP<MueLu::FactoryManagerBase const>, Teuchos::RCP<MueLu::FactoryManagerBase const>, Teuchos::RCP<MueLu::FactoryManagerBase const>)
File unknown, in MueLu::TopSmootherFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Build(MueLu::Level&) const
File unknown, in Teuchos::RCP<MueLu::SmootherBase<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > >& MueLu::Level::Get<Teuchos::RCP<MueLu::SmootherBase<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > > >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, MueLu::FactoryBase const*)
File unknown, in MueLu::SingleLevelFactoryBase::CallBuild(MueLu::Level&) const
File unknown, in MueLu::SmootherFactory<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::BuildSmoother(MueLu::Level&, MueLu::PreOrPost) const
File unknown, in MueLu::DirectSolver<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(MueLu::Level&)
File unknown, in MueLu::Amesos2Smoother<double, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::Setup(MueLu::Level&)
The matrix that is 25x25 and has 75 entries which are all identically zero.
Thanks for digging into this @cgcgcg . I think it makes sense to reopen the issue - do you agree? Unless we want to open a separate Trilinos one.
Sure, let's reopen.
We need to understand why these tests set up a MueLu preconditioner for a singular matrix, and then not use the preconditioner. @ikalash do you have time to look into it?
We need to understand why these tests set up a MueLu preconditioner for a singular matrix, and then not use the preconditioner. @ikalash do you have time to look into it?
Perhaps I misunderstood what @cgcgcg wrote, but it seems that it is the matrix at the coarsest grid level that is singular. Is that right? If so, would that suggest that there is something wrong with the matrix problem being solved using the AMG?
Sorry, I should have explained better. The problem is so small that this is a one-level method. The matrix is supplied by Albany.
Sorry, I should have explained better. The problem is so small that this is a one-level method. The matrix is supplied by Albany.
That's interesting. How was it working before? Was it because an iterative solve rather than a direct solve was done? Is there a branch/fork of Trilinos I can use to see the singularity / failure?
The failure was triggered by MueLu switching the factorization of the coarse grid from first solve to setup. So it seems like Albany is constructing the preconditioner, but then doesn't use it to solve a system. I can provide a patch against Trilinos tomorrow morning that triggers the behavior.
The failure was triggered by MueLu switching the factorization of the coarse grid from first solve to setup. So it seems like Albany is constructing the preconditioner, but then doesn't use it to solve a system. I can provide a patch against Trilinos tomorrow morning that triggers the behavior.
That would be great. I won't get to this until next week so it is no rush.
I am very sorry but I still haven't had a chance to work on this. Unfortunately I am really swamped right now getting ready for 2 all-hands meetings after the shutdown and working on a few other time-critical things. Does someone else have the time to look at this issue? I can pass along instructions on how to reproduce it from @cgcgcg . Maybe we can discuss this at the Albany meeting tomorrow.
I forgot to say, I am not sure when I would have a chance to look at this.
Ok, per the discussion at today's Albany meeting, I switched the problematic tests so that they use Ifpack2 to avoid this issue, allowing @cgcgcg to merge his PR. We can look more at the cause of the issue in the new year when me / others have more time.
@cgcgcg : very sorry for the delay! You should be able to merge your code now that you had reverted earlier due to these test failures.
No problem! Thanks for letting me know!
Sure! Again, my apologies that it took so long!
The demoPDEs tests that use adjoints from Tempus started failing yesterday 11/7:
demoPDEs_Advection1D_Scalar_Param_Adjoint_Sens_Explicit demoPDEs_Advection1D_with_Source_Dist_Param_Adjoint_Sens_Explicit_ConsistentM demoPDEs_Thermal1D_with_Source_Dist_Param_Adjoint_Sens_Explicit
https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=54052
It looks like there is an Amesos2 KLU2 error that happens after the time-integration is complete, it appears due to a messed up matrix that it is given:
I am wondering if this is related to recent changes to Tempus. Tagging @ccober6 who might have ideas about this theory.
I will investigate further.