Closed bartlettroscoe closed 3 years ago
@bartlettroscoe that test cuda.debug_pin_um_to_host
makes a comparison of time results to determine a "pass" criteria but is fragile and I think needs to be revisited, can the tests cuda.debug_pin_um_to_host
and cuda.debug_serial_execution
be disabled in the ATDM builds until a better criterion is put in place for the test?
Cross-referencing kokkos/kokkos#2506
The test:
KokkosCore_UnitTest_Cuda_MPI_1
failed in the build:
Trilinos-atdm-waterman-cuda-9.2-release-debug
yesterday as shown here showing:
[ RUN ] cuda.atomics
Loop<N10TestAtomic11SuperScalarILi4EEE>( test = 3 FAILED : { 4950, 9900, 14850, 19800} != { 4854, 9708, 14562, 19416}
/home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/TestAtomic.hpp:560: Failure
Value of: (TestAtomic::Loop<TestAtomic::SuperScalar<4>, TEST_EXECSPACE>(100, 3))
Actual: false
Expected: true
[ FAILED ] cuda.atomics (754 ms)
@ndellingwood, is this another timing problem? Should we expect to be seeing more random failures like this?
@bartlettroscoe I don't think that test (cuda.atomics
) relies on timing data (I'll take a look to confirm), let me test your build as well as a kokkos-only version and check if there can be anything random about this failure, afterwards I'll file a bug report as necessary.
I did not reproduce failure of that test within a Kokkos VOTD develop branch nor Trilinos VOTD develop branch, and I see nothing in the test depending on unreliable pass/fail criteria that would cause randomness in results.
@crtrott can running multiple tests on a GPU using ctest -j N
result in resource contention that might cause the atomics test to fail as posted above https://github.com/trilinos/Trilinos/issues/6799#issuecomment-587464443 ?
I tried something simple to see if I could reproduce, launched a job on Waterman where I ran the test 10000 times in a Kokkos build and in a Trilinos build (using the ATDM environment configuration provided earlier) but saw no occurrences of the failure.
@ndellingwood, it may only occur when running it with all of the other tests. I have updated the instructions as such.
@ndellingwood, could this happen when multiple kernels are running on the same GPU at the same time?
could this happen when multiple kernels are running on the same GPU at the same time
@bartlettroscoe I'm not certain, not clear to me if this could impact the atomic operations or disrupt something in the test that's pounding the atomics @crtrott any thoughts?
FYI: As shown in this query, there are more builds that show the error which include:
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt
(vortex)Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi
(vortex)Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
(ascigpu14)Trilinos-atdm-waterman-cuda-9.2-opt
(waterman)Trilinos-atdm-waterman-cuda-9.2-release-debug
(waterman)Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
(waterman)Trilinos-atdm-waterman_cuda-9.2_shared_opt
(waterman)This is not a fluke.
Again, the error is in the unit test cuda.debug_pin_um_to_host
and looks like:
[ RUN ] cuda.debug_pin_um_to_host
Time CudaSpace: 0.052340 CudaUVMSpace_1: 0.052490 CudaUVMSpace_2: 0.049860 CudaPinnedHostSpace: 0.096996 CudaUVMSpace_Pinned: 0.052756
/vscratch1/jenkins/vortex-slave/workspace/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt-exp/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/cuda/TestCuda_DebugPinUVMSpace.cpp:127: Failure
Value of: passed
Actual: false
Expected: true
[ FAILED ] cuda.debug_pin_um_to_host (378 ms)
FYI: As shown in this query, we are also seeing failures in the unit test cuda.debug_serial_execution
showing:
[ RUN ] cuda.debug_serial_execution
Time For1: 0.001218 For2: 0.001222 ForSerial: 0.009890
/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/cuda/TestCuda_DebugSerialExecution.cpp:140: Failure
Value of: passed_par_for
Actual: false
Expected: true
[ FAILED ] cuda.debug_serial_execution (48 ms)
with history:
Site | Build Name | Test Name | Status | Time | Proc Time | Details | Build Time | Processors |
---|---|---|---|---|---|---|---|---|
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Failed | 1m 28s 520ms | 1m 28s 520ms | Completed (Failed) | 2020-02-23T03:20:45 MST | 1 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Failed | 4m 35s 770ms | 4m 35s 770ms | Completed (Failed) | 2020-02-21T03:06:35 MST | 1 |
FYI: As shown in this query and this query, this test:
KokkosCore_UnitTest_Cuda_MPI_1
is also now failing every testing day starting 2020-03-22 in the build:
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
showing the errors:
[ RUN ] cuda.debug_serial_execution
Time Scan1: 0.023175 Scan2: 0.004769 ScanSerial: 0.023451
/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/cuda/TestCuda_DebugSerialExecution.cpp:192: Failure
Value of: passed_par_scan
Actual: false
Expected: true
[ FAILED ] cuda.debug_serial_execution (235 ms)
and
[ RUN ] cuda.debug_pin_um_to_host
Time CudaSpace: 0.063499 CudaUVMSpace_1: 0.279313 CudaUVMSpace_2: 0.077930 CudaPinnedHostSpace: 1.058533 CudaUVMSpace_Pinned: 0.077453
/home/jenkins/ride/workspace/Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/cuda/TestCuda_DebugPinUVMSpace.cpp:127: Failure
Value of: passed
Actual: false
Expected: true
[ FAILED ] cuda.debug_pin_um_to_host (1637 ms)
@bartlettroscoe that test
cuda.debug_pin_um_to_host
makes a comparison of time results to determine a "pass" criteria but is fragile and I think needs to be revisited, can the testscuda.debug_pin_um_to_host
andcuda.debug_serial_execution
be disabled in the ATDM builds until a better criterion is put in place for the test?
@ndellingwood, sorry I missed this comment of yours from before.
Yes, we can disable just those unit tests for just the ATDM Trilinos builds (or just the CUDA builds) as described in:
I think we likely just want to disable these for all ATDM Trilinos CUDA builds? IF that is the case, the instructions for doing that are in:
and use the CMake cache var <full_test_name>_EXTRA_ARGS
. See examples of this in:
$ cd Trilinos/
$ find cmake/std/atdm/ -name "*.cmake" -exec grep -nH "_EXTRA_ARGS" {} \;
cmake/std/atdm/ride/tweaks/CUDA-10.0_GNU-7.4.0_DEBUG_CUDA_POWER8_KEPLER37.cmake:5:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/CUDA-9.2_GNU-7.2.0_DEBUG_CUDA_POWER8_KEPLER37.cmake:5:ATDM_SET_CACHE(KokkosContainers_UnitTest_Serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/CUDA-9.2_GNU-7.2.0_DEBUG_CUDA_POWER8_KEPLER37.cmake:8:ATDM_SET_CACHE(KokkosKernels_graph_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/CUDA-9.2_GNU-7.2.0_DEBUG_CUDA_POWER8_KEPLER37.cmake:11:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/GNU-7.2.0_DEBUG_OPENMP_POWER8.cmake:8:ATDM_SET_CACHE(KokkosContainers_UnitTest_Serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/GNU-7.2.0_DEBUG_OPENMP_POWER8.cmake:11:ATDM_SET_CACHE(KokkosContainers_UnitTest_OpenMP_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/GNU-7.2.0_DEBUG_OPENMP_POWER8.cmake:14:ATDM_SET_CACHE(KokkosKernels_graph_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/GNU-7.2.0_DEBUG_OPENMP_POWER8.cmake:17:ATDM_SET_CACHE(KokkosKernels_sparse_openmp_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/GNU-7.2.0_DEBUG_OPENMP_POWER8.cmake:20:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/ride/tweaks/GNU-7.4.0_DEBUG_OPENMP_POWER8.cmake:5:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/shiller/tweaks/GNU_DEBUG_SERIAL_HSW.cmake:5:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/shiller/tweaks/CUDA-9.0_DEBUG_CUDA_KEPLER37.cmake:2:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/shiller/tweaks/INTEL_DEBUG_OPENMP_HSW.cmake:2:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/shiller/tweaks/INTEL_DEBUG_SERIAL_HSW.cmake:4:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
cmake/std/atdm/waterman/tweaks/CUDA-9.2_DEBUG_CUDA_POWER9_VOLTA70.cmake:8:ATDM_SET_CACHE(KokkosKernels_sparse_serial_MPI_1_EXTRA_ARGS
But I think you want to put this in the file:
Trilinos/cmake/std/atdm/ATDMDisables.cmake
in the if block for CUDA builds and just disable these unit tests in all CUDA builds
FYI: As shown here, we also saw this test:
KokkosCore_UnitTest_Cuda_MPI_1
failing today 2020-03-25 in the build:
Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug
showing:
[ RUN ] cuda.debug_pin_um_to_host
Time CudaSpace: 0.035394 CudaUVMSpace_1: 0.080651 CudaUVMSpace_2: 0.100436 CudaPinnedHostSpace: 0.555502 CudaUVMSpace_Pinned: 0.094931
/scratch/jenkins/ascicgpu14/workspace/Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/cuda/TestCuda_DebugPinUVMSpace.cpp:127: Failure
Value of: passed
Actual: false
Expected: true
[ FAILED ] cuda.debug_pin_um_to_host (999 ms)
That suggests that these unit tests should be disabled in all of that ATDM Trilinos CUDA builds for the time being.
@ndellingwood, do you just want me to do this and create the PR and have you review it?
do you just want me to do this and create the PR and have you review it
@bartlettroscoe that would be great thanks, I pinged the corresponding kokkos issue as well with your comment.
@ndellingwood, okay, I assigned this issue to myself for now and will work to get these unit tests disabled.
I had forgotten that I had already created this issue. This failing tests has also been taking down PR testing iterations as shown in https://github.com/trilinos/Trilinos/issues/3276#issuecomment-631749385.
I will disable these individual unit tests in all ATDM Trilinos CUDA builds and in the Trilinos PR CUDA build.
These two unit tests are disabled in all ATDM Trilinos CUDA PR builds in PR #7407 and has been merged to 'atdm-nightly' in commit 4804b08.
Putting in review.
Tests with issue trackers Passed: twip=3
Site | Build Name | Test Name | Status | Details | Consecutive Pass Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Passed | Completed | 20 | 0 | 20 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Passed | Completed | 20 | 0 | 20 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda_MPI_1 | Passed | Completed | 20 | 0 | 20 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=5
Site | Build Name | Test Name | Status | Details | Consecutive Pass Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
vortex | Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt | KokkosCore_UnitTest_Cuda_MPI_1 | Passed | Completed | 4 | 1 | 17 | #6799 |
vortex | Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi | KokkosCore_UnitTest_Cuda_MPI_1 | Passed | Completed | 5 | 1 | 18 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Passed | Completed | 26 | 0 | 26 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Passed | Completed | 26 | 0 | 26 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda_MPI_1 | Passed | Completed | 26 | 0 | 26 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Missing: twim=4
Site | Build Name | Test Name | Status | Details | Consecutive Missing Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
vortex | Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 7 | 1 | 14 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 5 | 0 | 24 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 5 | 0 | 24 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 5 | 0 | 24 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Missing: twim=5
Site | Build Name | Test Name | Status | Details | Consecutive Missing Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
vortex | Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 12 | 1 | 7 | #6799 |
vortex | Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 14 | 1 | 7 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 12 | 0 | 17 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 12 | 0 | 17 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 12 | 0 | 17 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Missing: twim=3
Site | Build Name | Test Name | Status | Details | Consecutive Missing Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 19 | 0 | 10 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 19 | 0 | 10 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 19 | 0 | 10 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Missing: twim=5
Site | Build Name | Test Name | Status | Details | Consecutive Missing Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
vortex | Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 26 | 0 | 3 | #6799 |
vortex | Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 28 | 0 | 1 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 26 | 0 | 4 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 26 | 0 | 4 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda_MPI_1 | Missing | Missing | 26 | 0 | 4 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Just as a comment: these tests are doing critical testing of whether the fencing behavior of Kokkos is correct, something tpetra has been complaining a lot aboud. The only way to check whether fencing behavior is correct without some external tools running is timing.
@crtrott, if you compare the Grover comments from 4 weeks ago to [3 weeks ago](), you can see these tests went from passing to missing. I did not do anything so it may be worth someone looking into why that happened. (If Kokkos developers can't figure that I can help with that. Should be able to determine this from just looking at results on CDsah.) Also note that we are getting a lot of missing test results from ATS-2 machine 'vortex' lately (see ATDV-396).
Does missing just mean fail? 4 weeks ago Kokkos 3.2 came in and we fixed more fencing issues (i.e. over fencing) largely reported by Tpetra. These tests are now testing that we don't screw up the Tpetra use cases again.
Does missing just mean fail?
@crtrott, no, it means that build X did not report any test results for that testing day. For example, what the Grover report today says is that those 5 tests were not included in the test results posted to CDash for those builds. Make sense?
Did those tests change their names in any way with the Kokkos 3.2 upgrade? If so, then that would result in them being reported as missing.
actually they might have. We split executables into multiple ones recently, I bet its now CUDA_1_MPI_1 or so (or maybe CUDA_2) not sure where the specific tests we were looking for went. But basically we split that up.
actually they might have. We split executables into multiple ones recently, I bet its now CUDA_1_MPI_1 or so (or maybe CUDA_2) not sure where the specific tests we were looking for went. But basically we split that up.
@trilinos/kokkos, @crtrott,
As shown in this query it looks like the unit test KokkosCore_UnitTest_Cuda_MPI_1
was split into 3 unit tests:
KokkosCore_UnitTest_Cuda1_MPI_1
KokkosCore_UnitTest_Cuda2_MPI_1
KokkosCore_UnitTest_Cuda3_MPI_1
starting on testing day 2020-08-25.
As shown in this query and this query (click "Show Matching Output"), it looks like the unit tests cuda.debug_pin_um_to_host
and cuda.debug_serial_execution
in the test KokkosCore_UnitTest_Cuda3_MPI_1
are is still randomly failing in the builds:
with recent history:
Site | Build Name | Test Name | Status | Time | Proc Time | Details | Build Time | Processors |
---|---|---|---|---|---|---|---|---|
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 5s 330ms | 5s 330ms | Completed (Failed) | 2020-09-17T23:57:12 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 4s 640ms | 4s 640ms | Completed (Failed) | 2020-09-13T23:53:51 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 4s 900ms | 4s 900ms | Completed (Failed) | 2020-09-10T23:53:47 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 4s 960ms | 4s 960ms | Completed (Failed) | 2020-09-05T23:53:57 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 3s 980ms | 3s 980ms | Completed (Failed) | 2020-09-20T22:45:10 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 3s 710ms | 3s 710ms | Completed (Failed) | 2020-09-17T22:45:15 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 4s 570ms | 4s 570ms | Completed (Failed) | 2020-09-12T22:45:11 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 3s 910ms | 3s 910ms | Completed (Failed) | 2020-09-03T22:45:13 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 5s 330ms | 5s 330ms | Completed (Failed) | 2020-09-18T01:34:42 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 5s 510ms | 5s 510ms | Completed (Failed) | 2020-09-09T06:39:15 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 5s 270ms | 5s 270ms | Completed (Failed) | 2020-08-30T01:20:20 MDT | 1 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 6s 930ms | 6s 930ms | Completed (Failed) | 2020-09-07T01:18:16 MDT | 1 |
sems-rhel7 | Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 1s 410ms | 1s 410ms | Completed (Failed) | 2020-09-12T04:33:13 MDT | 1 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 3s 770ms | 3s 770ms | Completed (Failed) | 2020-09-16T05:42:20 MDT | 1 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | 5s 230ms | 5s 230ms | Completed (Failed) | 2020-09-14T04:16:45 MDT | 1 |
showing errors like:
[ RUN ] cuda.debug_pin_um_to_host
Time CudaSpace: 0.048639 CudaUVMSpace_1: 0.243198 CudaUVMSpace_2: 0.056822 CudaPinnedHostSpace: 0.874996 CudaUVMSpace_Pinned: 0.248510
/scratch/atdm-devops-admin/atdm-trilinos-nightly-builds/Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/cuda/TestCuda_DebugPinUVMSpace.cpp:128: Failure
Value of: passed
Actual: false
Expected: true
[ FAILED ] cuda.debug_pin_um_to_host (1615 ms)
and
[ RUN ] cuda.debug_serial_execution
Time For1: 0.000495 For2: 0.000213 ForSerial: 0.000473
/scratch/atdm-devops-admin/atdm-trilinos-nightly-builds/Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg/SRC_AND_BUILD/Trilinos/packages/kokkos/core/unit_test/cuda/TestCuda_DebugSerialExecution.cpp:141: Failure
Value of: passed_par_for
Actual: false
Expected: true
[ FAILED ] cuda.debug_serial_execution (6 ms)
I will update the CSV file entries in the ATDM Trilinos Status repo used by Grover to associate these tests with this issue.
FYI: The update of the tracking of these randomly failing unit tests is shown in the commit:
@trilinos/kokkos, @crtrott, @ndellingwood,
Is pass/fail for these unit tests still based on timings? It looks like it might be (but hard to tell from the unit test output).
Yes it should be. Its testing for asynchronicity (i.e. that there aren't fences there shouldn't be) for certain api stuff
Yes it should be. Its testing for asynchronicity (i.e. that there aren't fences there shouldn't be) for certain api stuff
@crtrott, is there any way to make these tests more robust? Not good to have randomly failing tests.
We can take another look, but this is as robust as I could make them. The problem is that if something else is hogging the GPU and the test takes a bit to launch all the timing is gonna be off by an arbitrary large amount - i.e. there is no timing based criteria that ever could pass reliably. These tests already are just comparing two timings which are collected during the test, stuff like with fence vs without fence with expected differences usually being large (>4x) and the criteria being somewhere around 2x. Similar for the tests deducing the correct memory spaces. I mean the differences there should be also >5x and our criteria for passing are much smaller than that. I think these tests just have to run on their own and if you can't do that, then they need to be disabled and filtered out. But in general they test for semantics not for a specific performance. They are definitely not designed to catch small performance issues, they are only designed to test whether we do something fundamentally unexpected.
We can take another look, but this is as robust as I could make them. ... They are definitely not designed to catch small performance issues, they are only designed to test whether we do something fundamentally unexpected.
@crtrott, but any times on a loaded GPU can vary widely depending on what else is running at the same time. Can we aggregate the unit tests based on timings into their own separate unit test executable and the mark those tests with RUN_SERIAL
so that they always run by themselves with ctest? Tests that don't depend on timing can be run at the same time of the total wall-clock time goes down.
Actually, as shown in this query, this test KokkosCore_UnitTest_Cuda3_MPI_1
finishes in less than 6 seconds. Therefore, can we just pass in RUN_SERIAL
to the TRIBITS_ADD_TEST()
command?
Sure we can do that. @ndellingwood Nathan can you move all the relevant tests which use this timing stuff to Cuda3?
move all the relevant tests which use this timing stuff to Cuda3?
@crtrott and @ndellingwood, just a suggestion, but you might want to call that test something like KokkosCore_UnitTest_CudaTimingBased
to make it clear these are based on timing things?
Nathan can you move all the relevant tests which use this timing stuff to Cuda3?
@crtrott yes, I can group the tests together, and if you like I'll place them in a new test with a descriptive name like @bartlettroscoe mentioned
I opened kokkos/kokkos#3405 and self-assigned with the request to group these time-based tests into a common executable
Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=6
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=1
Site | Build Name | Test Name | Status | Details | Consecutive Pass Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 4 | 3 | 27 | #6799 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 9 | 5 | 25 | #6799 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 8 | 3 | 27 | #6799 |
sems-rhel7 | Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 5 | 2 | 23 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 26 | 1 | 28 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 25 | 1 | 29 | #6799 |
Site | Build Name | Test Name | Status | Details | Consecutive Non-pass Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | Completed (Failed) | 1 | 5 | 25 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=1
Site | Build Name | Test Name | Status | Details | Consecutive Pass Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 9 | 2 | 28 | #6799 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 14 | 4 | 26 | #6799 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 22 | 2 | 28 | #6799 |
sems-rhel7 | Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 6 | 2 | 26 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 28 | 0 | 28 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 28 | 0 | 28 | #6799 |
Site | Build Name | Test Name | Status | Details | Consecutive Non-pass Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Failed | Completed (Failed) | 1 | 2 | 28 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=6
Tests with issue trackers Missing: twim=1
Site | Build Name | Test Name | Status | Details | Consecutive Pass Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 7 | 2 | 28 | #6799 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_dbg | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 4 | 3 | 27 | #6799 |
cee-rhel6 | Trilinos-atdm-cee-rhel6_mini_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_static_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 3 | 2 | 28 | #6799 |
sems-rhel7 | Trilinos-atdm-sems-rhel7-cuda-9.2-Volta70-complex-shared-release-debug | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 13 | 2 | 26 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 28 | 0 | 28 | #6799 |
ride | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | KokkosCore_UnitTest_Cuda3_MPI_1 | Passed | Completed | 28 | 0 | 28 | #6799 |
Site | Build Name | Test Name | Status | Details | Consecutive Missing Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
cee-rhel6 | Trilinos-atdm-cee-rhel6_cuda-10.1.243_gcc-7.2.0_openmpi-4.0.3_shared_opt | KokkosCore_UnitTest_Cuda3_MPI_1 | Missing | Missing | -1 | 1 | 30 | #6799 |
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
Tests with issue trackers Passed: twip=7
This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.
CC: @trilinos/kokkos, @kddevin (Trilinos Data Services Product Lead)