Test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 appears to be randomly failing in many builds including CI, PR, and ATDM builds

bartlettroscoe commented 6 years ago

CC: @trilinos/framework, @trilinos/anasazi, @srajama1 (Trilinos Linear Solver Product Area Lead)

Next Action Status

PR #4052 merged to 'develop' on 12/18/2018 but still failing after that. Next: Try to fix again?

Description

It would seem that the test Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 is very occasionally randomly failing in various builds. As shown in this query, this test failed 10 times since 7/1/2018 in the builds:

Linux-GCC-4.8.4-MPI_RELEASE_DEBUG_SHARED_PT_OPENMP_CI (post-push CI build): 1 time (today)
PR-XXXX-test-Trilinos_pullrequest_gcc_4.9.3-YYYY (standard PR build): 4 times
PR-XXXX-test-Trilinos_pullrequest_gcc_4.8.4-YYYY (standard PR build): 1 time
Trilinos-atdm-chama-intel-debug-openmp (standard ATDM build): 1 time
Trilinos-atdm-rhel6-gnu-opt-openmp (standard ATDM build): 2 times
Trilinos-atdm-waterman-cuda-9.2-debug (standard ATDM build): 1 time

In each of these 10 failures in the last 3 months, such as the CI failure today shown here, it shows failures like:

projectAndNormalizeGen() returned rank 5
   || <S,S> - I || after  : 2.65912e-11
  1|| S_in - X1*C1 - X2*C2 - S_out*B || : 1.70776e-09
         vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv         tolerance exceeded! test failed!

The location of these failures seems to change in this test but all of the failures appear to be "tolerance exceeded! test failed!"

Is there some type of non-deterministic behavior in this test or in the underlying Anasazi code that allows for these types of random failures?

Steps to Reproduce

Given that this test seems to be failing randomly only very occasionally, this might be hard to reproduce locally. But given that this has failed in the post-push GCC 4.8.4 CI build and the GCC 4.9.3 PR build one might be able to use one of those.

bartlettroscoe commented 6 years ago

This randomly failing test has killed 5 PR build/test iterations in the last 3 months. It also broke the ATDM Trilinos builds 4 times in the last 3 months. That is not super high but it is something worth looking into.

srajama1 commented 6 years ago

@hkthorn

hkthorn commented 6 years ago

@srajama1 @bartlettroscoe The Anasazi code is deterministic, but most of its tests (like with Belos) use random vectors. Those are not deterministic. I will look into this test.

bartlettroscoe commented 6 years ago

CC: @fryeguy52, @srajama1, @hkthorn

FYI: This just broke the Trilinos-atdm-sems-rhel6-gnu-debug-openmp nightly build this morning as shown here showing:

...

projectGen(): testing [X2 Y1]-range multivector against P_{X2,X2} P_{Y1,Y1} 
   || <S,Y1> || before     : 139287
   || <S,Y2> || before     : 4.50888
  0|| S_in - X1*C1 - X2*C2 - S_out || : 0
  1|| S_in - X1*C1 - X2*C2 - S_out || : 1.08075e-11
         vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv         tolerance exceeded! test failed!
  1|| <Y[0],S> || after      : 2.94507e-10
  2|| S_in - X1*C1 - X2*C2 - S_out || : 1.08075e-11
         vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv         tolerance exceeded! test failed!
  2|| <Y[0],S> || after      : 2.94507e-10
  3|| S_in - X1*C1 - X2*C2 - S_out || : 1.08075e-11
         vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv         tolerance exceeded! test failed!
...

bartlettroscoe commented 5 years ago

NOTE: We are seeing this in other ATDM Trilinos builds as well, for example, as shown here which shows the failure in the builds on days:

Site	Build Name	Test Name	Status	Time	Details	Build Time
cee-rhel6	Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt	Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4	Failed	17s 930ms	Completed (Failed)	2018-11-26T09:04:41 UTC
sems-rhel6	Trilinos-atdm-sems-rhel6-gnu-debug-serial	Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4	Failed	630ms	Completed (Failed)	2018-11-15T04:28:01 UTC

fryeguy52 commented 5 years ago

FYI: this test failed in Trilinos-atdm-sems-rhel6-gnu-opt-serial on 12/3/2018 and in Trilinos-atdm-cee-rhel6-gnu-7.2.0-openmpi-1.10.2-serial-static-opt on 11/26/2018

see here

srajama1 commented 5 years ago

@hkthorn This seems to be failing in serial as well which makes me worry about this.

srajama1 commented 5 years ago

@bartlettroscoe : Which app in ATDM is enabling Anasazi ? Do they really need Anasazi ?

bartlettroscoe commented 5 years ago

Which app in ATDM is enabling Anasazi ? Do they really need Anasazi ?

@srajama1, currently SPARC explicitly enables and pulls in Anasazi. Whether they actually need it or not is a question for @micahahoward or perhaps @mhoemmen.

bartlettroscoe commented 5 years ago

@srajama1, also, EMPIRE implicitly enables Anasazi through Panzer. @rppawlo should know if EMPIRE is actually using Anasazi.

rppawlo commented 5 years ago

EMPIRE is not using Anasazi. I don't believe there are any tests in panzer for using Anasazi either. It is probably coming in as an optional dependency for LOCA/NOX.

bartlettroscoe commented 5 years ago

FYI: The Trilinos CI build just got bit by this again just now as shown here. Therefore, this will also be breaking Trilinos PR test iterations randomly going forward. As shown in this query you can see this break PR builds as recently as today and on 2018-11-29.

And as @fryeguy52 noted above, this fairly recently broke some (promoted) ATDM builds.

Once we can implement support for the "randomly_failing" and "ok_to_fail" fields in the CDash analysis tool being developed in #2933, we will classify this test as randomly_failing=2 (i.e. very rarely randomly fails) and okay_to_fail=1 (don't trigger global FAIL). That will remove this failing test as a problem for ATDM.

hkthorn commented 5 years ago

@bartlettroscoe I think I have a fix for this. It is a test that generates a biorthogonal basis. The dimensions used add up to something close enough to the dimension of the operator that generating a degenerate orthogonal subspace from random vectors is much more likely than with the two other orthogonalization tests. I have experimented with this test on the CEE, where I have a binary that fails often enough for me to explore the issue. I have reduced the subspace sizes for the orthogonalization test and that failure seems to stop occurring on this particular platform. I would like to check in this modification.

bartlettroscoe commented 5 years ago

I have reduced the subspace sizes for the orthogonalization test and that failure seems to stop occurring on this particular platform. I would like to check in this modification.

@hkthorn, excellent! Thanks for looking into this. Please create a PR for this and I will review and approve it (unless you want another Belos developer like @mhoemmen to review it).

hkthorn commented 5 years ago

@bartlettroscoe This is such a minor change, I don't think we need to bother @mhoemmen with it.

hkthorn commented 5 years ago

@bartlettroscoe @srajama1 Thanks for your patience on this one. I finally found a platform that displayed the issue with some regularity so that I could explore it with some confidence.

bartlettroscoe commented 5 years ago

PR #4052 with likely fix merged to 'develop' on 12/18/2018. Next: Let run for two months and then close if no failures by 2/18/2018 ...

srajama1 commented 5 years ago

@hkthorn Thanks for taking care of this. @bartlettroscoe If it is ok I recommend closing this and reopening if it still happens again.

bartlettroscoe commented 5 years ago

@srajama1 said

@hkthorn Thanks for taking care of this. @bartlettroscoe If it is ok I recommend closing this and reopening if it still happens again.

Okay, I set up a calendar reminder to check on this. If we find there were more random failures (or find out sooner than that), we will re-open this issue.

@hkthorn, thanks for fixing this!

(We are still trying to figure out the best way to address these only very ocassinal randomly failing tests. Hopefully this process of optimistically closing the issue but setting up a reminder to check is a good one.)

Closing the issue ...

From: Bartlett, Roscoe A Sent: Friday, December 14, 2018 3:16 PM To: Frye, Joe Subject: #3585: Verify that Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4 has not failed since 12/14/2018 When: Monday, February 11, 2019 10:00 AM-10:30 AM (UTC-05:00) Eastern Time (US & Canada). Where: Reminder

This calendar reminder is to check the query:

https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-10-09&filtercombine=and&filtercombine=and&filtercombine=and&filtercombine=and&filtercount=5&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=testname&compare2=61&value2=Anasazi_Epetra_OrthoManagerGenTester_0_MPI_4&field3=status&compare3=61&value3=failed&field4=details&compare4=64&value4=timeout&field5=buildstarttime&compare5=83&value5=2018-12-14

and verify that there were no more random failures like seen in:

https://github.com/trilinos/Trilinos/issues/3585

NOTE: The issue is already closed so just add a comment in the closed issue that indeed there have been no more random failures. If you do see any new random failures like this, please reopen the issue and add a comment about the failures.

mhoemmen commented 5 years ago

@hkthorn Y'all let me have a couple days off ;-)

bartlettroscoe commented 5 years ago

This test failed again in the build Trilinos-atdm-sems-rhel6-intel-opt-openmp yesterday 1/6/2019 as shown here showing:

projectAndNormalizeGen() returned rank 3
   || <S,S> - I || after  : 2.05042e-12
         vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv         tolerance exceeded! test failed!
  19|| S_in - X1*C1 - X2*C2 - S_out*B || : 1.01754e-08
  19|| <Y[0],S> || after      : 3.04764e-12

And looking at this query there have been 5 failures since 12/14/2018, including in a PR test build.

Reopening this issue :-(

fryeguy52 commented 5 years ago

failed 3 times in the last month on Trilinos-atdm-sems-rhel7-gnu-7.2.0-openmp-complex-shared-release-debug. see here

fryeguy52 commented 5 years ago

Looking closer this test has failed over a dozen times on 8 different Trilinos builds in the last month shown here

bartlettroscoe commented 5 years ago

Anasazi is not used by ATDM so failures in Anasazi tests are not an ATDM issue. Therefore, I am removing the 'client: ATDM' label to get this off of our list of active issues for ATDM. (Really we should not even be running Anasazi tests in ATDM Trilinos builds.)

bartlettroscoe commented 4 years ago

FYI: These tests have been disabled in ATDM Trilinos testing as per:

https://sems-atlassian-srn.sandia.gov/browse/ATDV-263

Adding "Disabled Tests" label.

bartlettroscoe commented 4 years ago

@trilinos/framework, this test took our a PR testing iteration as shown in https://github.com/trilinos/Trilinos/pull/6641#issuecomment-579044999. It showed:

projectAndNormalizeGen() returned rank 3
   || <S,S> - I || after  : 1.05981e-12
         vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv         tolerance exceeded! test failed!

This test should be removed from all PR iterations.

github-actions[bot] commented 3 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

mhoemmen commented 3 years ago

@trilinos/anasazi

github-actions[bot] commented 2 years ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

github-actions[bot] commented 2 years ago

This issue was closed due to inactivity for 395 days.

trilinos / Trilinos