Open bartlettroscoe opened 4 years ago
CC: @jjellio
Currently, I can't filter out these failures because CDash does not provide a way to filter tests based on the return code. But one way to address the inability to filter these failures out this is to update trilinos_jsrun
script to print the return code from the jsrun
command and also print if any output was detected from the jsrun
command to the temp file. It could print something like this:
jsrun return: 255 (lines of STDOUT/STDERR = 0)
Then we could add a CDash queryTests.php filter field to filter out tests that have the above output. That would reduce the amount of noise we are getting in the 'twif' table in the cdash_analyze_and_report.py emails.
CC: @e10harvey
NOTE: This may or may not be related to the *.out
fix in PR #7406 but I think it is more likely related to the mas "Error initializing RM connection. Exiting" described in #6861.
With the commit https://github.com/trilinos/Trilinos/pull/7427/commits/00373548ed1877ef55129a7a4c91159678c7e264, we can now filter out these failing tests. I have done so with the commit:
*** Base Git Repo: TrilinosATDMStatus
f952849 "Filter out 'jsrun return value: 255' (trilinos/Trilinos#7122)"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date: Thu May 28 11:39:13 2020 -0400 (7 minutes ago)
M trilinos_promoted_atdm_builds_status.sh
M trilinos_specialized_atdm_builds_status.sh
Now these should not be polluting the cdash_analyze_and_report.py generated emails anymore.
Adding label Stalled
to get this off of the main list of issues.
After the sysadmins changing 'vortex' to use a private launch node by defaults for 'bsub', it seems that all random jsrun failures are gone. See details in https://github.com/trilinos/Trilinos/issues/6861#issuecomment-722621107.
CC: @trilinos/framework
This is still occurring and it is failing Trilinos PR builds now. See https://github.com/trilinos/Trilinos/pull/10648#issuecomment-1164470776.
As shown in this query, you can see this failure has occurred in at least 9 PR builds since 5/1/2022:
Framework is in the process of migration off of ats2.
This issue is still bringing down PR builds. The latest example https://github.com/trilinos/Trilinos/pull/10796#issuecomment-1193513266.
I am going to go ahead and pin this too this this error is still taking out PR builds because they are still running on this machine.
These are happening in the ATDM Trilinos 'vortex' builds as well as can be seen here
and it seems this happens in a given build on random days.
CC: @e10harvey
I have been noticing random test failures on the 'ats2' 'vortex' builds where the test shows no output and the return value is 255. For example, must today in the build:
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi
we saw the test failures:
These show output like:
and
In both of these tests, the return value was '255'.
I believe that these occur in builds where we see mass random failures like for this build today:
that also show the jsrun errors "Error initializing RM connection. Exiting" as described in #6861.