Closed bartlettroscoe closed 2 years ago
This randomly failing test has killed 5 PR build/test iterations in the last 3 months. It also broke the ATDM Trilinos builds 4 times in the last 3 months. That is not super high but it is something worth looking into.
@hkthorn
@srajama1 @bartlettroscoe The Anasazi code is deterministic, but most of its tests (like with Belos) use random vectors. Those are not deterministic. I will look into this test.
CC: @fryeguy52, @srajama1, @hkthorn
FYI: This just broke the Trilinos-atdm-sems-rhel6-gnu-debug-openmp
nightly build this morning as shown here showing:
...
projectGen(): testing [X2 Y1]-range multivector against P_{X2,X2} P_{Y1,Y1}
|| <S,Y1> || before : 139287
|| <S,Y2> || before : 4.50888
0|| S_in - X1*C1 - X2*C2 - S_out || : 0
1|| S_in - X1*C1 - X2*C2 - S_out || : 1.08075e-11
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv tolerance exceeded! test failed!
1|| <Y[0],S> || after : 2.94507e-10
2|| S_in - X1*C1 - X2*C2 - S_out || : 1.08075e-11
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv tolerance exceeded! test failed!
2|| <Y[0],S> || after : 2.94507e-10
3|| S_in - X1*C1 - X2*C2 - S_out || : 1.08075e-11
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv tolerance exceeded! test failed!
...
NOTE: We are seeing this in other ATDM Trilinos builds as well, for example, as shown here which shows the failure in the builds on days:
Site | Build Name | Test Name | Status | Time | Details | Build Time |
---|---|---|---|---|---|---|
cee-rhel6 | Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt | Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 | Failed | 17s 930ms | Completed (Failed) | 2018-11-26T09:04:41 UTC |
sems-rhel6 | Trilinos-atdm-sems-rhel6-gnu-debug-serial | Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 | Failed | 630ms | Completed (Failed) | 2018-11-15T04:28:01 UTC |
FYI: this test failed in
Trilinos-atdm-sems-rhel6-gnu-opt-serial
on 12/3/2018
and in
Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt
on 11/26/2018
see here
@hkthorn This seems to be failing in serial as well which makes me worry about this.
@bartlettroscoe : Which app in ATDM is enabling Anasazi ? Do they really need Anasazi ?
Which app in ATDM is enabling Anasazi ? Do they really need Anasazi ?
@srajama1, currently SPARC explicitly enables and pulls in Anasazi. Whether they actually need it or not is a question for @micahahoward or perhaps @mhoemmen.
@srajama1, also, EMPIRE implicitly enables Anasazi through Panzer. @rppawlo should know if EMPIRE is actually using Anasazi.
EMPIRE is not using Anasazi. I don't believe there are any tests in panzer for using Anasazi either. It is probably coming in as an optional dependency for LOCA/NOX.
FYI: The Trilinos CI build just got bit by this again just now as shown here. Therefore, this will also be breaking Trilinos PR test iterations randomly going forward. As shown in this query you can see this break PR builds as recently as today and on 2018-11-29.
And as @fryeguy52 noted above, this fairly recently broke some (promoted) ATDM builds.
Once we can implement support for the "randomly_failing" and "ok_to_fail" fields in the CDash analysis tool being developed in #2933, we will classify this test as randomly_failing=2
(i.e. very rarely randomly fails) and okay_to_fail=1
(don't trigger global FAIL). That will remove this failing test as a problem for ATDM.
@bartlettroscoe I think I have a fix for this. It is a test that generates a biorthogonal basis. The dimensions used add up to something close enough to the dimension of the operator that generating a degenerate orthogonal subspace from random vectors is much more likely than with the two other orthogonalization tests. I have experimented with this test on the CEE, where I have a binary that fails often enough for me to explore the issue. I have reduced the subspace sizes for the orthogonalization test and that failure seems to stop occurring on this particular platform. I would like to check in this modification.
I have reduced the subspace sizes for the orthogonalization test and that failure seems to stop occurring on this particular platform. I would like to check in this modification.
@hkthorn, excellent! Thanks for looking into this. Please create a PR for this and I will review and approve it (unless you want another Belos developer like @mhoemmen to review it).
@bartlettroscoe This is such a minor change, I don't think we need to bother @mhoemmen with it.
@bartlettroscoe @srajama1 Thanks for your patience on this one. I finally found a platform that displayed the issue with some regularity so that I could explore it with some confidence.
PR #4052 with likely fix merged to 'develop' on 12/18/2018. Next: Let run for two months and then close if no failures by 2/18/2018 ...
@hkthorn Thanks for taking care of this. @bartlettroscoe If it is ok I recommend closing this and reopening if it still happens again.
@srajama1 said
@hkthorn Thanks for taking care of this. @bartlettroscoe If it is ok I recommend closing this and reopening if it still happens again.
Okay, I set up a calendar reminder to check on this. If we find there were more random failures (or find out sooner than that), we will re-open this issue.
@hkthorn, thanks for fixing this!
(We are still trying to figure out the best way to address these only very ocassinal randomly failing tests. Hopefully this process of optimistically closing the issue but setting up a reminder to check is a good one.)
Closing the issue ...
From: Bartlett, Roscoe A Sent: Friday, December 14, 2018 3:16 PM To: Frye, Joe Subject: #3585: Verify that Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 has not failed since 12/14/2018 When: Monday, February 11, 2019 10:00 AM-10:30 AM (UTC-05:00) Eastern Time (US & Canada). Where: Reminder
This calendar reminder is to check the query:
and verify that there were no more random failures like seen in:
https://github.com/trilinos/Trilinos/issues/3585
NOTE: The issue is already closed so just add a comment in the closed issue that indeed there have been no more random failures. If you do see any new random failures like this, please reopen the issue and add a comment about the failures.
@hkthorn Y'all let me have a couple days off ;-)
This test failed again in the build Trilinos-atdm-sems-rhel6-intel-opt-openmp
yesterday 1/6/2019 as shown here showing:
projectAndNormalizeGen() returned rank 3
|| <S,S> - I || after : 2.05042e-12
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv tolerance exceeded! test failed!
19|| S_in - X1*C1 - X2*C2 - S_out*B || : 1.01754e-08
19|| <Y[0],S> || after : 3.04764e-12
And looking at this query there have been 5 failures since 12/14/2018, including in a PR test build.
Reopening this issue :-(
failed 3 times in the last month on Trilinos-atdm-sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug
. see here
Looking closer this test has failed over a dozen times on 8 different Trilinos builds in the last month shown here
Anasazi is not used by ATDM so failures in Anasazi tests are not an ATDM issue. Therefore, I am removing the 'client: ATDM' label to get this off of our list of active issues for ATDM. (Really we should not even be running Anasazi tests in ATDM Trilinos builds.)
FYI: These tests have been disabled in ATDM Trilinos testing as per:
Adding "Disabled Tests" label.
@trilinos/framework, this test took our a PR testing iteration as shown in https://github.com/trilinos/Trilinos/pull/6641#issuecomment-579044999. It showed:
projectAndNormalizeGen() returned rank 3
|| <S,S> - I || after : 1.05981e-12
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv tolerance exceeded! test failed!
This test should be removed from all PR iterations.
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
@trilinos/anasazi
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE
label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE
.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.
This issue was closed due to inactivity for 395 days.
CC: @trilinos/framework, @trilinos/anasazi, @srajama1 (Trilinos Linear Solver Product Area Lead)
Next Action Status
PR #4052 merged to 'develop' on 12/18/2018 but still failing after that. Next: Try to fix again?
Description
It would seem that the test
Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4
is very occasionally randomly failing in various builds. As shown in this query, this test failed 10 times since 7/1/2018 in the builds:Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI
(post-push CI build): 1 time (today)PR-XXXX-test-Trilinos_pullrequest_gcc_4.9.3-YYYY
(standard PR build): 4 timesPR-XXXX-test-Trilinos_pullrequest_gcc_4.8.4-YYYY
(standard PR build): 1 timeTrilinos-atdm-chama-intel-debug-openmp
(standard ATDM build): 1 timeTrilinos-atdm-rhel6-gnu-opt-openmp
(standard ATDM build): 2 timesTrilinos-atdm-waterman-cuda-9.2-debug
(standard ATDM build): 1 timeIn each of these 10 failures in the last 3 months, such as the CI failure today shown here, it shows failures like:
The location of these failures seems to change in this test but all of the failures appear to be "tolerance exceeded! test failed!"
Is there some type of non-deterministic behavior in this test or in the underlying Anasazi code that allows for these types of random failures?
Steps to Reproduce
Given that this test seems to be failing randomly only very occasionally, this might be hard to reproduce locally. But given that this has failed in the post-push GCC 4.8.4 CI build and the GCC 4.9.3 PR build one might be able to use one of those.