??: Add label "client: ATDM">
??: Add label "ATDM Sev: Blocker" (by default but could be other "ATDM Sev: XXX")>
??: Add label "type: bug"?>
??: Add label "impacting: tests"?>
??: Add label for affected packages (e.g. "pkg: MueLu", "pkg: Tpetra", "pkg: Kokkos", etc.)>
??: Add label "PA: ???Project Area???" (e.g. "PA: Linear Solvers", "PA: Data Services")>
??: Add milestone "Initial cleanup of new ATDM ..." or "Keep promoted ATDM ...">
??: Once GitHub Issue is created, add entries for tests to TrilinosATDMStatus/*.csv files>
## Next Action Status
## Description
As shown in [this query over may days](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&begin=2020-01-01&end=2020-12-14&filtercount=10&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-ats2&field3=testname&compare3=65&value3=SEACAS&field4=status&compare4=62&value4=passed&field5=testoutput&compare5=94&value5=Error%20initializing%20RM%20connection.%20Exiting&field6=testoutput&compare6=96&value6=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field7=testoutput&compare7=96&value7=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field8=testoutput&compare8=96&value8=cudaMallocManaged.*cudaErrorUnknown.*unknown%20error.*Sacado_DynamicArrayTraits.hpp&field9=testoutput&compare9=96&value9=srun%3A%20error.*launch%20failed%3A%20Error%20configuring%20interconnect&field10=testoutput&compare10=97&value10=(0a1%2C2%7C0a1%2C3%7C2%2C3c2%2C3%7C1%2C2c1%2C2%7C0a1%2C4)) and [this query for 2020-12-14](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2020-12-14&filtercount=10&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-ats2&field3=testname&compare3=65&value3=SEACAS&field4=status&compare4=62&value4=passed&field5=testoutput&compare5=94&value5=Error%20initializing%20RM%20connection.%20Exiting&field6=testoutput&compare6=96&value6=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field7=testoutput&compare7=96&value7=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field8=testoutput&compare8=96&value8=cudaMallocManaged.*cudaErrorUnknown.*unknown%20error.*Sacado_DynamicArrayTraits.hpp&field9=testoutput&compare9=96&value9=srun%3A%20error.*launch%20failed%3A%20Error%20configuring%20interconnect&field10=testoutput&compare10=97&value10=(0a1%2C2%7C0a1%2C3%7C2%2C3c2%2C3%7C1%2C2c1%2C2%7C0a1%2C4)) the tests:
* `SEACASAprepro_aprepro_array_test`
* `SEACASAprepro_aprepro_command_line_include_test`
* `SEACASAprepro_aprepro_command_line_vars_test`
* `SEACASAprepro_aprepro_test_dump_reread`
* `SEACASAprepro_aprepro_unit_test`
* `SEACASAprepro_lib_aprepro_lib_array_test`
* `SEACASAprepro_lib_aprepro_lib_unit_test`
* `SEACASIoss_structured_cgns_assembly_copy`
* `SEACASIoss_structured_cgns_assembly_copy_fpp`
in the builds:
* `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg`
* `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi`
* `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt`
* `Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi`
* `Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_dbg`
* `Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt`
started failing on testing day 2020-08-20.
These all appear to be diffs involving the `trilinos_jsrun` script as shown [here](https://testing.sandia.gov/cdash/test/46981887) showing:
```
================================================================================
TEST_2
Running: "diff" "-w" "/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt/SRC_AND_BUILD/Trilinos/packages/seacas/applications/aprepro/test-array.stderr.gold" "/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt/SRC_AND_BUILD/BUILD/packages/seacas/applications/aprepro/test-array.stderr"
--------------------------------------------------------------------------------
0a1,3
> WARNING, you have not set TPETRA_ASSUME_CUDA_AWARE_MPI=0 or 1, defaulting to TPETRA_ASSUME_CUDA_AWARE_MPI=0
> BEFORE: jsrun '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt/SRC_AND_BUILD/BUILD/packages/seacas/applications/aprepro/aprepro' '-q' '--include=/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt/SRC_AND_BUILD/Trilinos/packages/seacas/applications/aprepro' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt/SRC_AND_BUILD/Trilinos/packages/seacas/applications/aprepro/test-array.i' 'test-array.out'
> AFTER: export TPETRA_ASSUME_CUDA_AWARE_MPI=0; jsrun '-M -disable_gpu_hooks' '-p' '1' '--rs_per_socket' '4' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt/SRC_AND_BUILD/BUILD/packages/seacas/applications/aprepro/aprepro' '-q' '--include=/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt/SRC_AND_BUILD/Trilinos/packages/seacas/applications/aprepro' '/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-gnu-7.3.1-spmpi-rolling_serial_static_opt/SRC_AND_BUILD/Trilinos/packages/seacas/applications/aprepro/test-array.i' 'test-array.out'
15a19
> jsrun return value: 0
--------------------------------------------------------------------------------
TEST_2: Return code = 1
TEST_2: Pass criteria = Zero return code [FAILED]
TEST_2: Result = FAILED
================================================================================
```
## Current Status on CDash
Run [this query](https://testing.sandia.gov/cdash/queryTests.php?project=Trilinos&filtercount=10&showfilters=1&filtercombine=and&field1=groupname&compare1=62&value1=Experimental&field2=buildname&compare2=65&value2=Trilinos-atdm-ats2&field3=testname&compare3=65&value3=SEACAS&field4=status&compare4=62&value4=passed&field5=testoutput&compare5=94&value5=Error%20initializing%20RM%20connection.%20Exiting&field6=testoutput&compare6=96&value6=srun%3A%20error%3A%20s_p_parse_file%3A%20unable%20to%20read%20.%2Fetc%2Fslurm%2Fslurm.conf.%3A%20Permission%20denied&field7=testoutput&compare7=96&value7=cudaGetDeviceCount.*cudaErrorUnknown.*unknown%20error.*Kokkos_Cuda_Instance.cpp&field8=testoutput&compare8=96&value8=cudaMallocManaged.*cudaErrorUnknown.*unknown%20error.*Sacado_DynamicArrayTraits.hpp&field9=testoutput&compare9=96&value9=srun%3A%20error.*launch%20failed%3A%20Error%20configuring%20interconnect&field10=testoutput&compare10=97&value10=(0a1%2C2%7C0a1%2C3%7C2%2C3c2%2C3%7C1%2C2c1%2C2%7C0a1%2C4)).
## Steps to Reproduce
One should be able to reproduce this failure as described in:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md
and the system-specific instructions at:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system
Just log into any of the associated machines and copy and paste the full CDash build name `` listed above and run commands like:
```
$ cd /
$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh
$ cmake \
-GNinja \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_=ON \
$TRILINOS_DIR
$ make NP=16
$ ctest -j4
```
where `` is any package that you want to enabled to reproduce build and/or test results.
Again, for exact system-specific details on what commands to run to build and run tests, see:
* https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#specific-instructions-for-each-system
And if you can't figure out what commands to run to produce the issue given the above-referenced documentation, please post a comment here and we will give you the exact minimal commands to reproduce the failures.
Does it make sense for the test runner on this system to set the TPETRA_ASSUME_CUDA_AWARE_MPI=0 value that would avoid the extraneous message?
I realize that a test the depends on stdout/stderr output is not robust and am looking at modifying the code such that the desired information is output to a file instead of stderr/stdout, but I also think that defning that enviornment variable would also help quiet the warning...
Does it make sense for the test runner on this system to set the TPETRA_ASSUME_CUDA_AWARE_MPI=0 value that would avoid the extraneous message?
@gsjaardema: For cuda builds, the test runner on this system runs the entire test suite with both TPETRA_ASSUME_CUDA_AWARE_MPI=0 (as shown here) and TPETRA_ASSUME_CUDA_AWARE_MPI=1 (as shown here). This is done to cover both code paths since spmpi is not cuda aware. Checking the most recent cdash results shows that these tests are failing both with TPETRA_ASSUME_CUDA_AWARE_MPI=0 and with TPETRA_ASSUME_CUDA_AWARE_MPI=1. Perhaps something is amiss with the test runner; can you check whether setting TPETRA_ASSUME_CUDA_AWARE_MPI=0 fixes the problem?
I realize that a test the depends on stdout/stderr output is not robust and am looking at modifying the code such that the desired information is output to a file instead of stderr/stdout, but I also think that defning that enviornment variable would also help quiet the warning
Hello @gsjaardema, I have found that trying to diff the entire output of STDOUT or even a big file is a bit fragile. It is usually better, if you can, to grep for specific key things in some output or a file (unless that output is small and very controlled in which case you can diff the entire thing). And you also have to be careful about race conditions that happen a lot with STDOUT. (I have noticed that some of the SEACAS tests randomly fail with jumbled STDOUT that break the diffs.) It can take a good bit of work to create a strong, robust and portable test suite.
@gsjaardema: Sorry for the delay; I've been out for a couple weeks. I see grover is reporting twif=12 but the detailed test results appear to be truncated - I will file a separate issue to look into this. In the mean time, I see the following:
To close this out would you please address the remaining failures related to this issue? The test failures unrelated to this issue will be ticketed via the triaging process once 8480 is closed.
CC: @trilinos/seacas @kddevin (Trilinos Data Services Product Lead), @gsjaardema, @e10harvey, @trilinos/framework