trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.21k stars 563 forks source link

Random failures due to jumbled output in TpetraCore_MatrixMarket_Tpetra_CrsMatrix_Dist_Binary_simple_MPI_1 breaking PR builds starting 2022-07-08 #10898

Open bartlettroscoe opened 2 years ago

bartlettroscoe commented 2 years ago

CC: @trilinos/tpetra, @tasmith4

Description

As shown in this query (click "Shown Matching Output" in upper right) the test:

is randomly failing in the builds:

started testing day 2022-07-08.

Just like for the Tpetra tests reported in issue #10885, these failures are caused by jumbled output breaking up the printing of End Result: TEST PASSED like shown here showing:

End RKokkos::Cuda::Cuda instance constructor : ERROR device not initialized
Kokkos::Cuda::Cuda instance constructor : ERROR device not initialized
esult: TEST PASSED

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

It is a randomly failing test so it will be hard to reproduce.

bartlettroscoe commented 2 years ago

FYI: This was the only test failure which took out the last iteration if my PR build https://github.com/trilinos/Trilinos/pull/10808#issuecomment-1219053621. I have been trying to get that PR build to pass PR testing for going on 3 weeks now and random Tpetra test failures have taken out several of those iterations.

tasmith4 commented 2 years ago

@bartlettroscoe this is a little different from the last one, since it's not output deliberately printed by Tpetra. I've noticed it before on other projects as well, but I'm not exactly sure what the root cause is. I'm reaching out to the Kokkos team for more information on this.

tasmith4 commented 2 years ago

@bartlettroscoe from my conversation on the Kokkos slack, it sounds like this is actually a Kokkos bug, which was resolved in https://github.com/kokkos/kokkos/pull/5151. This fix will be available in Kokkos 3.7 -- I'll leave it up to you whether it's better to wait for Kokkos 3.7 to make it into Trilinos or pull the fix over now.

bartlettroscoe commented 2 years ago

I'll leave it up to you whether it's better to wait for Kokkos 3.7 to make it into Trilinos or pull the fix over now.

@tasmith4, I think it can wait for the Kokkos upgrade.

However, it would be good to know how many Tpetra tests are failing due to jumbled output. It occurred to me how to search for that and I think this query does that which shows:

image

So between this issue and #10885, I think that catches them all.

csiefer2 commented 2 years ago

FYI - Trilinos PR for Kokkos/KokkosKernels update is supposed to get put in this week (as per Nathan).

bartlettroscoe commented 2 years ago

@tasmith4, @csiefer2, what might help is to carefully flush the streams before and after printing End Result: TEST PASSSED. If you are only outputting from one MPI rank then that may eliminate the jumbled output problem.

tasmith4 commented 2 years ago

@bartlettroscoe I think for most if not all tests we just write to the Teuchos unit test "out" stream, and a lot of that stuff gets handled however the Teuchos unit testing framework/command line options specify (I've never dug super deep into that).

bartlettroscoe commented 2 years ago

@bartlettroscoe I think for most if not all tests we just write to the Teuchos unit test "out" stream, and a lot of that stuff gets handled however the Teuchos unit testing framework/command line options specify (I've never dug super deep into that).

Right, but that is just a stream. Perhaps we should create a function in TriBITS called Tribits:printEndResultTestPassed() that will do the proper flushing and only print on the root process?

tasmith4 commented 2 years ago

Perhaps we should create a function in TriBITS called Tribits:printEndResultTestPassed() that will do the proper flushing and only print on the root process?

I could go for that. Could be a lot of work to retrofit existing tests though.

github-actions[bot] commented 1 year ago

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label. If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE. If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.