trilinos / Trilinos

Primary repository for the Trilinos Project
https://trilinos.org/
Other
1.2k stars 564 forks source link

Random tests failures on 'ats2' 'vortex' with no STDOUT/STDERR and return code 255 #7122

Open bartlettroscoe opened 4 years ago

bartlettroscoe commented 4 years ago

CC: @e10harvey

I have been noticing random test failures on the 'ats2' 'vortex' builds where the test shows no output and the return value is 255. For example, must today in the build:

we saw the test failures:

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24­static­dbg_­cuda-aware-mpi Sacado­trad­sfc­example­MPI_­1 Failed Completed (Failed) 1 3 20  
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24­static­dbg_­cuda-aware-mpi Sacado­tradoptest­34­EQA­MPI_­1 Failed Completed (Failed) 1 3 20  

These show output like:

BEFORE: jsrun  '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/sacado/example/Sacado_trad_sfc_example.exe' '-v'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=1; jsrun  '-M -disable_gpu_hooks' '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/sacado/example/Sacado_trad_sfc_example.exe' '-v'
out_file=2c5e914b8db237218486ae3661bf193a.out

and

BEFORE: jsrun  '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/sacado/test/tradoptest/Sacado_tradoptest_34_EQA.exe'
AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=1; jsrun  '-M -disable_gpu_hooks' '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg/SRC_AND_BUILD/BUILD/packages/sacado/test/tradoptest/Sacado_tradoptest_34_EQA.exe'
out_file=4e266b5a408c7b3d0ab2a064d3ea0224.out

In both of these tests, the return value was '255'.

I believe that these occur in builds where we see mass random failures like for this build today:

that also show the jsrun errors "Error initializing RM connection. Exiting" as described in #6861.

bartlettroscoe commented 4 years ago

CC: @jjellio

Currently, I can't filter out these failures because CDash does not provide a way to filter tests based on the return code. But one way to address the inability to filter these failures out this is to update trilinos_jsrun script to print the return code from the jsrun command and also print if any output was detected from the jsrun command to the temp file. It could print something like this:

jsrun return: 255 (lines of STDOUT/STDERR = 0)

Then we could add a CDash queryTests.php filter field to filter out tests that have the above output. That would reduce the amount of noise we are getting in the 'twif' table in the cdash_analyze_and_report.py emails.

bartlettroscoe commented 4 years ago

CC: @e10harvey

NOTE: This may or may not be related to the *.out fix in PR #7406 but I think it is more likely related to the mas "Error initializing RM connection. Exiting" described in #6861.

bartlettroscoe commented 4 years ago

With the commit https://github.com/trilinos/Trilinos/pull/7427/commits/00373548ed1877ef55129a7a4c91159678c7e264, we can now filter out these failing tests. I have done so with the commit:

*** Base Git Repo: TrilinosATDMStatus
f952849 "Filter out 'jsrun return value: 255' (trilinos/Trilinos#7122)"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Thu May 28 11:39:13 2020 -0400 (7 minutes ago)

M       trilinos_promoted_atdm_builds_status.sh
M       trilinos_specialized_atdm_builds_status.sh

Now these should not be polluting the cdash_analyze_and_report.py generated emails anymore.

bartlettroscoe commented 4 years ago

Adding label Stalled to get this off of the main list of issues.

bartlettroscoe commented 3 years ago

After the sysadmins changing 'vortex' to use a private launch node by defaults for 'bsub', it seems that all random jsrun failures are gone. See details in https://github.com/trilinos/Trilinos/issues/6861#issuecomment-722621107.

bartlettroscoe commented 2 years ago

CC: @trilinos/framework

This is still occurring and it is failing Trilinos PR builds now. See https://github.com/trilinos/Trilinos/pull/10648#issuecomment-1164470776.

bartlettroscoe commented 2 years ago

As shown in this query, you can see this failure has occurred in at least 9 PR builds since 5/1/2022:

e10harvey commented 2 years ago

Framework is in the process of migration off of ats2.

bartlettroscoe commented 2 years ago

This issue is still bringing down PR builds. The latest example https://github.com/trilinos/Trilinos/pull/10796#issuecomment-1193513266.

bartlettroscoe commented 1 year ago

I am going to go ahead and pin this too this this error is still taking out PR builds because they are still running on this machine.

bartlettroscoe commented 1 year ago

These are happening in the ATDM Trilinos 'vortex' builds as well as can be seen here

and it seems this happens in a given build on random days.