trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 564 forks source link

New Nalu diffs; New Belos CGPositiveDefiniteFailure #12255

Closed spdomin closed 1 year ago

spdomin commented 1 year ago

Description

We have O(50) new diffs showing up in our Nalu-nightly gcc-testing. One of the tests in our suite is singular (this is not uncommon in a subset of low-Mach flows where the number of expected linear solver iterations is <<< number_of_dofs), and I noted the following error:

terminate called after throwing an instance of 'Belos::CGPositiveDefiniteFailure'

This might suggest some Ifpack2 or Belos work being pushed?

I am running a bisect now and will report back as soon as I have the particular push (Nalu was unchanged over this time period; therefore, this is only a Trilinos/develop bisect that requires ~5 steps).

Steps to Reproduce

Good:Trilinos/develop SHA1: cb704921daf0dc1dc00c0d700ec16d1ab32892a9 Bad: Trilinos/develop SHA1: db4a41e9368abd57b8df356453bfd38e8ecc262a

Config: https://github.com/NaluCFD/Nalu/blob/master/build/do-configTrilinos_release

spdomin commented 1 year ago

Hmmm... On my first bisect attempt,

commit 1212aebdf8bd9dfcf2b371d9aef4c4fef09f5516 Merge: 729ea6d 6ed7fb3 Author: trilinos-autotester trilinos@sandia.gov Date: Thu Sep 7 17:02:51 2023 -0500

I am seeing these type of errors:

//home/spdomin/gitHubWork/scratch_build/install/gcc10.3.0/Trilinos_nightly_release/include/stk_util/parallel/DataExchangeUnknownPatternNonBlocking.hpp:217:39: error: invalid conversion from 'int' to 'MPI_Comm' {aka 'ompi_communicator_t*'} [-fpermissive]
  217 |     MPI_Iprobe(MPI_ANY_SOURCE, m_tag, m_comm, &flag, &status);
//home/spdomin/gitHubWork/scratch_build/install/gcc10.3.0/Trilinos_nightly_release/include/Teuchos_DefaultMpiComm.hpp:844:20: error: invalid 'static_cast' from type 'void*' to type 'int'
  844 |     *rawMpiComm == MPI_COMM_NULL, std::invalid_argument,
      |                    ^~~~~~~~~~~~~

I suppose this suggests a bad Trilinos state?

cgcgcg commented 1 year ago

@spdomin The bad SHA doesn't seem to resolve to a Trilinos commit. We had some history rewriting yesterday. (See https://github.com/trilinos/Trilinos/issues/12250.) Can you pull a fresh develop and try with that? For the errors that you posted: those are build errors, not runtime errors. How does that related to the issue above?

spdomin commented 1 year ago

@cgcgcg, after noting the fails and diffs this morning, I executed a "git bisect" - noting the "good" and "bad" SHA1s noted above. The particular commit that I shared above was provided to me as part of the standard bisect process.

I can try pulling on a fresh develop to see what happens.

cgcgcg commented 1 year ago

@spdomin I suggested to pull a fresh develop to make it easier on our end to actually see where in the commit history you are.

spdomin commented 1 year ago

@cgcgcg, I processed the following, as per the Framework issue suggestion, in my current clone:

git fetch origin (assuming that your origin is https://github.com/trilinos/Trilinos.git))
git checkout develop
git reset --hard origin/develop
sebrowne commented 1 year ago

@spdomin please reach out to me via email (or a Teams call) if you need more help confirming that you have resolved the history issue. Apologies for the confusion that this has caused.

spdomin commented 1 year ago

@sebrowne, are these valid SHA1s?

Good:Trilinos/develop SHA1: https://github.com/trilinos/Trilinos/commit/cb704921daf0dc1dc00c0d700ec16d1ab32892a9 Bad: Trilinos/develop SHA1: db4a41e9368abd57b8df356453bfd38e8ecc262a

After all, that is what my regression test summary provided. However, if they are not valid since the history was confused, then how would I proceed with the bisect?

sebrowne commented 1 year ago

Honestly the best way is probably to compare the commit messages. Can you provide the summary from that bad commit? I can't find it in my backup for some reason (ominous but hopefully not a big deal). That way we can see what the rewritten history version is and allow you to proceed with bisection on the changed history.

spdomin commented 1 year ago

I suppose it is probably an obvious question as to why we are re-writing history?

Otherwise, I am having zero luck on my bisect. I followed your instructions on the other issue, and provoded the "good" and "bad" SHA1s from my regression test output summary.

I am still see errors:

/home/spdomin/gitHubWork/scratch_build/install/gcc10.3.0/Trilinos_nightly_release/include/Teuchos_DefaultMpiComm.hpp: In function 'Teuchos::RCP<Teuchos::MpiComm > Teuchos::createMpiComm(const Teuchos::RCP<const Teuchos::OpaqueWrapper >&, int)': //home/spdomin/gitHubWork/scratch_build/install/gcc10.3.0/Trilinos_nightly_release/include/Teuchos_DefaultMpiComm.hpp:1849:48: error: invalid 'static_cast' from type 'void' to type 'int' 1849 | if( rawMpiComm.get()!=NULL && rawMpiComm != MPI_COMM_NULL )

and I am on this Trilinos/develop:

commit eab6d5d32a7ec09901cdbc26b590c59312c2bd84 Author: Brian Kelley bmkelle@sandia.gov Date: Tue Sep 5 17:58:54 2023 -0600

Stokhos: fix KokkosKernels #1959

PR #12190 actually failed to fix KokkosSparse::spmv for Sacado scalar
types, when building with Kokkos/KokkosKernels develop branch.

This actually fixes that issue (tested with develop and master KokkosKernels)
and is quite a bit cleaner (though it uses version macros that can be
taken out for the 4.2 release).
cgcgcg commented 1 year ago

@spdomin Does the build error appear when using the "bad" commit?

spdomin commented 1 year ago

Okay, using a new clone of Trilinos, I was able to find a reasonable "good" and today's "bad" version that actually maps to the current history. The previous issue was that my SHA1s were not valid.

I will start my bisect now and hope for the best:)

Try this to find a good state:

  1. 1dc04e5ac5df41ffa4825a39171a79cc7652d885 a) this is a good state from September 11th, 2023.

Today's bad is:

  1. 5a91a8852fd21e6e23928d1f1b7b63df46f50183 a) this is a bad state from September 13th, 2023.
spdomin commented 1 year ago

Ugh... Some days, bisects just do not work out well...

This version of Trilinos, which was committed by @csiefer2, is not building:

commit 2af83a29aa4ef57ef138c801ec6cd3d2a44427f1 Author: Chris Siefert csiefer@sandia.gov Date: Thu Aug 31 15:13:27 2023 -0600

Tpetra: Getting random pools correct

/home/spdomin/gitHubWork/nightlyBuildAndTest/Trilinos/packages/tpetra/core/src/Tpetra_Details_StaticView.cpp: In static member function 'static Kokkos::Random_XorShift64_Pool& Tpetra::Details::Impl::Static_Random_XorShift64_Pool::getPool(unsigned int)': /home/spdomin/gitHubWork/nightlyBuildAndTest/Trilinos/packages/tpetra/core/src/Tpetra_Details_StaticView.cpp:449:35: error: 'finalize_cuda_memory' was not declared in this scope 449 | Kokkos::push_finalize_hook (finalize_cuda_memory);

cgcgcg commented 1 year ago

Are you bisecting only on PR merges?

spdomin commented 1 year ago

There are only 'skip'ped commits left to test. The first bad commit could be any of: 2af83a29aa4ef57ef138c801ec6cd3d2a44427f1 96c2ff34a9db098d1bbf3538de4b5b5e0cf94f35 We cannot bisect more!

spdomin commented 1 year ago

@cgcgcg, yes - as far as I know. Really, the process is simple and something I have done for years now on the Trilinos product.

  1. Notice a diff in the nightly test suite that is using a particular Trilinos/develop version
  2. Backtrack to the last "good" SHA1.
  3. git bisect start

  4. For each Trilinos version, re-configure and re-build trilinos.
  5. Re-build nalu using the current version provided by bisect
  6. git bisect good || bad
  7. Proceed until the culprit is found.

I guess the question should be posed to @csiefer2: Do you feel that any of the code changes you provided in your push contributed to a set of new Nalu diffs?

spdomin commented 1 year ago

Manually, I found that commit ff92fc9d6f1181524e62e3a518ff5e1bb3c363f4 is GOOD, while 96c2ff34a9db098d1bbf3538de4b5b5e0cf94f35 is BAD... However, 2af83a29aa4ef57ef138c801ec6cd3d2a44427f1 does not build (lineage below).

As such, it looks like the diffs were caused by the top two below. I will wait to hear from @csiefer2 - hopefully, someone else has an opinion.

commit 96c2ff34a9db098d1bbf3538de4b5b5e0cf94f35 Author: Chris Siefert csiefer2@users.noreply.github.com Date: Wed Sep 6 08:20:56 2023 -0600

Update Tpetra_Details_StaticView.cpp

commit 2af83a29aa4ef57ef138c801ec6cd3d2a44427f1 Author: Chris Siefert csiefer@sandia.gov Date: Thu Aug 31 15:13:27 2023 -0600

Tpetra: Getting random pools correct

commit ff92fc9d6f1181524e62e3a518ff5e1bb3c363f4 Merge: 4614011 1aab00c Author: trilinos-autotester trilinos@sandia.gov Date: Mon Sep 11 19:46:15 2023 -0500

cgcgcg commented 1 year ago

That commit does change what some random samples will be. So it can lead to diffs. But it shouldn't be at a significant level.

spdomin commented 1 year ago

The diffs look "reasonable" - while the:

terminate called after throwing an instance of 'Belos::CGPositiveDefiniteFailure'

is more concerning new behavior.

As with past inquiries with diffs, I am happy to accept the diffs if:

  1. We agree that they are due to newly pushed "correct code" - with a full description of why diffs occurred, and
  2. The new code implementation offers lower long-term technical debt.

Best,

cgcgcg commented 1 year ago

Yeah, the exception looks more concerning to me. I don't think that @csiefer2's changes should cause that, but I'll let him chime in. @spdomin : Do you have multigrid logs for one of the failing test so that we can do a comparison between good and bad?

spdomin commented 1 year ago

I can generate a set of logs (will communicate this via email). Can I achieve this by simply modifying the verbosity option in the xml?

cgcgcg commented 1 year ago

Yes, <Parameter name="verbosity" type="string" value="extreme"/> would be great!

spdomin commented 1 year ago

Bumping up the Chebyshev max iterations from 15 to 100 resolved the one particular throw, at the cost of a somewhat increased preconditioner setup time (20% slower - something for our changing linear systems would not be ideal). Picking an intermediate value of 25 also resulted in a non-catostrophic test, with not too much increase in simulation time.

I suppose that I am still waiting to hear the words "yes, we understand this change, and want this new implementation as production capability". Perhaps @jhux2 has an opinion?

best,

spdomin commented 1 year ago

I found another test that is throwing:

Throw number = 68
Throw test that evaluated to true: info > 0
KLU2 numeric factorization failed

This case is a vortex advecting through a periodic domain using a higher-order element type (polynomial order = 2, or a quadratic element that provides, nominally, third-order spatial accuracy).

cgcgcg commented 1 year ago

This typically suggests that KLU2 is failing to factor a singular matrix. Periodic BCs require some special treatment to remove constant modes. Are we failing to somehow propagate that correctly to the coarse grid?

spdomin commented 1 year ago

Yes, this is another one of our problematic systems that offers a pure set of Neumann bcs. It's kind of life in the low-Mach world with, as you note, several methods to circumvent the issue. I will pursue a few of those options for this case.

rppawlo commented 1 year ago

We just started seeing the KLU factorization failures in EMPIRE yesterday too.

csiefer2 commented 1 year ago

@rppawlo Ugh. Lovely. Who knew random numbers were so fickle :)

rppawlo commented 1 year ago

I need to follow up with the team. This may have snuck in earlier. I don't know if we got a new trilinos sync this week or if the one reporting this error used their own trilinos.

jhux2 commented 1 year ago

Bumping up the Chebyshev max iterations from 15 to 100 resolved the one particular throw, at the cost of a somewhat increased preconditioner setup time (20% slower - something for our changing linear systems would not be ideal). Picking an intermediate value of 25 also resulted in a non-catostrophic test, with not too much increase in simulation time.

@spdomin Are you increasing the value of "chebyshev: eigenvalue max iterations from 15 to 100? That would indicate that perhaps MueLu's estimate is too low. In turn, that would cause Chebyshev smoothing to cause some error modes to blow up. Why you wouldn't have seen this previously is a little strange, however.

rppawlo commented 1 year ago

Just wanted to mention. In the past, when we saw this error in empire it was related to nondeterministic algorithm used by muelu. The failures were random, sometimes the tests passes and sometimes they didn't. To avoid this we used this in our main solver (and on the sub blocks):

refMaxwell.sublist("smoother: pre params").set("relaxation: mtgs coloring algorithm","serial");
refMaxwell.sublist("smoother: post params").set("relaxation: mtgs coloring algorithm","serial");

Not sure if this is the same problem this time around.

I just checked with empire team and it seems to have started showing up using a Trilinos build from Monday Sept 11. So it is new.

csiefer2 commented 1 year ago

@rppawlo As per git, the random pools stuff which seems to be causing @spdomin's issues merged on Tuesday Sept. 12. So the EMPIRE's problem is likely something different.

spdomin commented 1 year ago

Nalu accepted the diffs. I can report back if we have trouble in the future in production runs.