Closed spdomin closed 1 year ago
Hmmm... On my first bisect attempt,
commit 1212aebdf8bd9dfcf2b371d9aef4c4fef09f5516 Merge: 729ea6d 6ed7fb3 Author: trilinos-autotester trilinos@sandia.gov Date: Thu Sep 7 17:02:51 2023 -0500
I am seeing these type of errors:
//home/spdomin/gitHubWork/scratch_build/install/gcc10.3.0/Trilinos_nightly_release/include/stk_util/parallel/DataExchangeUnknownPatternNonBlocking.hpp:217:39: error: invalid conversion from 'int' to 'MPI_Comm' {aka 'ompi_communicator_t*'} [-fpermissive]
217 | MPI_Iprobe(MPI_ANY_SOURCE, m_tag, m_comm, &flag, &status);
//home/spdomin/gitHubWork/scratch_build/install/gcc10.3.0/Trilinos_nightly_release/include/Teuchos_DefaultMpiComm.hpp:844:20: error: invalid 'static_cast' from type 'void*' to type 'int'
844 | *rawMpiComm == MPI_COMM_NULL, std::invalid_argument,
| ^~~~~~~~~~~~~
I suppose this suggests a bad Trilinos state?
@spdomin The bad SHA doesn't seem to resolve to a Trilinos commit. We had some history rewriting yesterday. (See https://github.com/trilinos/Trilinos/issues/12250.) Can you pull a fresh develop and try with that? For the errors that you posted: those are build errors, not runtime errors. How does that related to the issue above?
@cgcgcg, after noting the fails and diffs this morning, I executed a "git bisect" - noting the "good" and "bad" SHA1s noted above. The particular commit that I shared above was provided to me as part of the standard bisect process.
I can try pulling on a fresh develop to see what happens.
@spdomin I suggested to pull a fresh develop to make it easier on our end to actually see where in the commit history you are.
@cgcgcg, I processed the following, as per the Framework issue suggestion, in my current clone:
git fetch origin (assuming that your origin is https://github.com/trilinos/Trilinos.git))
git checkout develop
git reset --hard origin/develop
@spdomin please reach out to me via email (or a Teams call) if you need more help confirming that you have resolved the history issue. Apologies for the confusion that this has caused.
@sebrowne, are these valid SHA1s?
Good:Trilinos/develop SHA1: https://github.com/trilinos/Trilinos/commit/cb704921daf0dc1dc00c0d700ec16d1ab32892a9 Bad: Trilinos/develop SHA1: db4a41e9368abd57b8df356453bfd38e8ecc262a
After all, that is what my regression test summary provided. However, if they are not valid since the history was confused, then how would I proceed with the bisect?
Honestly the best way is probably to compare the commit messages. Can you provide the summary from that bad commit? I can't find it in my backup for some reason (ominous but hopefully not a big deal). That way we can see what the rewritten history version is and allow you to proceed with bisection on the changed history.
I suppose it is probably an obvious question as to why we are re-writing history?
Otherwise, I am having zero luck on my bisect. I followed your instructions on the other issue, and provoded the "good" and "bad" SHA1s from my regression test output summary.
I am still see errors:
/home/spdomin/gitHubWork/scratch_build/install/gcc10.3.0/Trilinos_nightly_release/include/Teuchos_DefaultMpiComm.hpp: In function 'Teuchos::RCP<Teuchos::MpiComm
and I am on this Trilinos/develop:
commit eab6d5d32a7ec09901cdbc26b590c59312c2bd84 Author: Brian Kelley bmkelle@sandia.gov Date: Tue Sep 5 17:58:54 2023 -0600
Stokhos: fix KokkosKernels #1959
PR #12190 actually failed to fix KokkosSparse::spmv for Sacado scalar
types, when building with Kokkos/KokkosKernels develop branch.
This actually fixes that issue (tested with develop and master KokkosKernels)
and is quite a bit cleaner (though it uses version macros that can be
taken out for the 4.2 release).
@spdomin Does the build error appear when using the "bad" commit?
Okay, using a new clone of Trilinos, I was able to find a reasonable "good" and today's "bad" version that actually maps to the current history. The previous issue was that my SHA1s were not valid.
I will start my bisect now and hope for the best:)
Try this to find a good state:
Today's bad is:
Ugh... Some days, bisects just do not work out well...
This version of Trilinos, which was committed by @csiefer2, is not building:
commit 2af83a29aa4ef57ef138c801ec6cd3d2a44427f1 Author: Chris Siefert csiefer@sandia.gov Date: Thu Aug 31 15:13:27 2023 -0600
Tpetra: Getting random pools correct
/home/spdomin/gitHubWork/nightlyBuildAndTest/Trilinos/packages/tpetra/core/src/Tpetra_Details_StaticView.cpp: In static member function 'static Kokkos::Random_XorShift64_Pool
Are you bisecting only on PR merges?
There are only 'skip'ped commits left to test. The first bad commit could be any of: 2af83a29aa4ef57ef138c801ec6cd3d2a44427f1 96c2ff34a9db098d1bbf3538de4b5b5e0cf94f35 We cannot bisect more!
@cgcgcg, yes - as far as I know. Really, the process is simple and something I have done for years now on the Trilinos product.
git bisect start
I guess the question should be posed to @csiefer2: Do you feel that any of the code changes you provided in your push contributed to a set of new Nalu diffs?
Manually, I found that commit ff92fc9d6f1181524e62e3a518ff5e1bb3c363f4 is GOOD, while 96c2ff34a9db098d1bbf3538de4b5b5e0cf94f35 is BAD... However, 2af83a29aa4ef57ef138c801ec6cd3d2a44427f1 does not build (lineage below).
As such, it looks like the diffs were caused by the top two below. I will wait to hear from @csiefer2 - hopefully, someone else has an opinion.
commit 96c2ff34a9db098d1bbf3538de4b5b5e0cf94f35 Author: Chris Siefert csiefer2@users.noreply.github.com Date: Wed Sep 6 08:20:56 2023 -0600
Update Tpetra_Details_StaticView.cpp
commit 2af83a29aa4ef57ef138c801ec6cd3d2a44427f1 Author: Chris Siefert csiefer@sandia.gov Date: Thu Aug 31 15:13:27 2023 -0600
Tpetra: Getting random pools correct
commit ff92fc9d6f1181524e62e3a518ff5e1bb3c363f4 Merge: 4614011 1aab00c Author: trilinos-autotester trilinos@sandia.gov Date: Mon Sep 11 19:46:15 2023 -0500
That commit does change what some random samples will be. So it can lead to diffs. But it shouldn't be at a significant level.
The diffs look "reasonable" - while the:
terminate called after throwing an instance of 'Belos::CGPositiveDefiniteFailure'
is more concerning new behavior.
As with past inquiries with diffs, I am happy to accept the diffs if:
Best,
Yeah, the exception looks more concerning to me. I don't think that @csiefer2's changes should cause that, but I'll let him chime in. @spdomin : Do you have multigrid logs for one of the failing test so that we can do a comparison between good and bad?
I can generate a set of logs (will communicate this via email). Can I achieve this by simply modifying the verbosity option in the xml?
Yes, <Parameter name="verbosity" type="string" value="extreme"/>
would be great!
Bumping up the Chebyshev max iterations from 15 to 100 resolved the one particular throw, at the cost of a somewhat increased preconditioner setup time (20% slower - something for our changing linear systems would not be ideal). Picking an intermediate value of 25 also resulted in a non-catostrophic test, with not too much increase in simulation time.
I suppose that I am still waiting to hear the words "yes, we understand this change, and want this new implementation as production capability". Perhaps @jhux2 has an opinion?
best,
I found another test that is throwing:
Throw number = 68
Throw test that evaluated to true: info > 0
KLU2 numeric factorization failed
This case is a vortex advecting through a periodic domain using a higher-order element type (polynomial order = 2, or a quadratic element that provides, nominally, third-order spatial accuracy).
This typically suggests that KLU2 is failing to factor a singular matrix. Periodic BCs require some special treatment to remove constant modes. Are we failing to somehow propagate that correctly to the coarse grid?
Yes, this is another one of our problematic systems that offers a pure set of Neumann bcs. It's kind of life in the low-Mach world with, as you note, several methods to circumvent the issue. I will pursue a few of those options for this case.
We just started seeing the KLU factorization failures in EMPIRE yesterday too.
@rppawlo Ugh. Lovely. Who knew random numbers were so fickle :)
I need to follow up with the team. This may have snuck in earlier. I don't know if we got a new trilinos sync this week or if the one reporting this error used their own trilinos.
Bumping up the Chebyshev max iterations from 15 to 100 resolved the one particular throw, at the cost of a somewhat increased preconditioner setup time (20% slower - something for our changing linear systems would not be ideal). Picking an intermediate value of 25 also resulted in a non-catostrophic test, with not too much increase in simulation time.
@spdomin Are you increasing the value of "chebyshev: eigenvalue max iterations
from 15 to 100? That would indicate that perhaps MueLu's estimate is too low. In turn, that would cause Chebyshev smoothing to cause some error modes to blow up. Why you wouldn't have seen this previously is a little strange, however.
Just wanted to mention. In the past, when we saw this error in empire it was related to nondeterministic algorithm used by muelu. The failures were random, sometimes the tests passes and sometimes they didn't. To avoid this we used this in our main solver (and on the sub blocks):
refMaxwell.sublist("smoother: pre params").set("relaxation: mtgs coloring algorithm","serial");
refMaxwell.sublist("smoother: post params").set("relaxation: mtgs coloring algorithm","serial");
Not sure if this is the same problem this time around.
I just checked with empire team and it seems to have started showing up using a Trilinos build from Monday Sept 11. So it is new.
@rppawlo As per git, the random pools stuff which seems to be causing @spdomin's issues merged on Tuesday Sept. 12. So the EMPIRE's problem is likely something different.
Nalu accepted the diffs. I can report back if we have trouble in the future in production runs.
Description
We have O(50) new diffs showing up in our Nalu-nightly gcc-testing. One of the tests in our suite is singular (this is not uncommon in a subset of low-Mach flows where the number of expected linear solver iterations is <<< number_of_dofs), and I noted the following error:
terminate called after throwing an instance of 'Belos::CGPositiveDefiniteFailure'
This might suggest some Ifpack2 or Belos work being pushed?
I am running a bisect now and will report back as soon as I have the particular push (Nalu was unchanged over this time period; therefore, this is only a Trilinos/develop bisect that requires ~5 steps).
Steps to Reproduce
Good:Trilinos/develop SHA1: cb704921daf0dc1dc00c0d700ec16d1ab32892a9 Bad: Trilinos/develop SHA1: db4a41e9368abd57b8df356453bfd38e8ecc262a
Config: https://github.com/NaluCFD/Nalu/blob/master/build/do-configTrilinos_release