Closed fryeguy52 closed 5 years ago
@fryeguy52 I will look at this problem today.
@fryeguy52 I followed the instruction to reproduce the error but I could not reproduce it on white. Could you double check if you can reproduce it ? I use
[kyukim @white11] master > git remote -v
origin https://github.com/trilinos/Trilinos.git (fetch)
origin https://github.com/trilinos/Trilinos.git (push)
[kyukim @white11] master > git branch
* develop
master
[kyukim @white11] master > git log
commit 01fb63caf88db53491f12afe0497c9d8f2cde09f
Merge: 4c4ecbf 3f3b1cc
Author: Mark Hoemmen <mhoemmen@users.noreply.github.com>
Date: Mon Jan 21 13:12:15 2019 -0700
Merge pull request #4224 from trilinos/Fix-4220
MiniTensor: Attempt to fix #4220
This is output from white.
[kyukim @white11] atdm > bsub -x -Is -q rhel7F -n 16 ctest -j16
***Forced exclusive execution
Job <43094> is submitted to queue <rhel7F>.
<<Waiting for dispatch ...>>
<<Starting on white22>>
Test project /ascldap/users/kyukim/Work/lib/trilinos/build/white/atdm
Start 1: TpetraCore_Behavior_Default_MPI_4
Start 2: TpetraCore_Behavior_Named_MPI_4
Start 3: TpetraCore_Behavior_Off_MPI_4
Start 4: TpetraCore_Behavior_On_MPI_4
1/194 Test #1: TpetraCore_Behavior_Default_MPI_4 ........................................................... Passed 0.91 sec
Start 5: TpetraCore_gemv_MPI_1
Start 6: TpetraCore_gemm_m_eq_1_MPI_1
Start 7: TpetraCore_gemm_m_eq_2_MPI_1
Start 8: TpetraCore_gemm_m_eq_5_MPI_1
2/194 Test #2: TpetraCore_Behavior_Named_MPI_4 ............................................................. Passed 0.92 sec
Start 9: TpetraCore_gemm_m_eq_13_MPI_1
Start 11: TpetraCore_BlockMultiVector2_MPI_1
Start 14: TpetraCore_BlockView_MPI_1
Start 15: TpetraCore_BlockOps_MPI_1
3/194 Test #3: TpetraCore_Behavior_Off_MPI_4 ............................................................... Passed 0.93 sec
Start 10: TpetraCore_BlockMultiVector_MPI_4
4/194 Test #4: TpetraCore_Behavior_On_MPI_4 ................................................................ Passed 0.96 sec
Start 12: TpetraCore_BlockCrsMatrix_MPI_4
5/194 Test #15: TpetraCore_BlockOps_MPI_1 ................................................................... Passed 1.39 sec
Start 16: TpetraCore_BlockExpNamespace_MPI_1
6/194 Test #14: TpetraCore_BlockView_MPI_1 .................................................................. Passed 2.53 sec
Start 31: TpetraCore_Core_initialize_where_tpetra_initializes_kokkos_MPI_1
7/194 Test #11: TpetraCore_BlockMultiVector2_MPI_1 .......................................................... Passed 2.83 sec
Start 32: TpetraCore_Core_ScopeGuard_where_tpetra_initializes_kokkos_MPI_1
8/194 Test #10: TpetraCore_BlockMultiVector_MPI_4 ........................................................... Passed 2.82 sec
Start 13: TpetraCore_BlockMap_MPI_4
9/194 Test #16: TpetraCore_BlockExpNamespace_MPI_1 .......................................................... Passed 1.63 sec
Start 33: TpetraCore_Core_initialize_where_user_initializes_kokkos_MPI_1
10/194 Test #31: TpetraCore_Core_initialize_where_tpetra_initializes_kokkos_MPI_1 ............................ Passed 0.81 sec
Start 34: TpetraCore_Core_ScopeGuard_where_user_initializes_kokkos_MPI_1
11/194 Test #33: TpetraCore_Core_initialize_where_user_initializes_kokkos_MPI_1 .............................. Passed 1.17 sec
Start 39: TpetraCore_issue_434_already_initialized_MPI_1
12/194 Test #12: TpetraCore_BlockCrsMatrix_MPI_4 ............................................................. Passed 4.19 sec
...
100% tests passed, 0 tests failed out of 194
Subproject Time Summary:
Tpetra = 1492.44 sec*proc (194 tests)
Total Test time (real) = 94.37 sec
@kyungjoo-kim Thanks for looking into this. I will try and reproduce and watch what it does tonight in testing.
@kyungjoo-kim and @fryeguy52,
I just logged onto 'white' really quick and pulled Trilinos 'develop' as of version 3ef91e9 :
3ef91e9 "Merge Pull Request #4253 from trilinos/Trilinos/Fix-4234"
Author: trilinos-autotester <trilinos-autotester@trilinos.org>
Date: Thu Jan 24 08:15:36 2019 -0700 (7 hours ago)
and following the instructions here I ran:
$ bsub -x -I -q rhel7F -n 16 \
./checkin-test-atdm.sh cuda-9.2-gnu-7.2.0-release-debug --enable-packages=TpetraCore --local-do-all
and it returned:
FAILED (NOT READY TO PUSH): Trilinos: white26
Thu Jan 24 15:06:51 MST 2019
Enabled Packages: TpetraCore
Build test results:
-------------------
0) MPI_RELEASE_DEBUG_SHARED_PT_OPENMP => Test case MPI_RELEASE_DEBUG_SHARED_PT_OPENMP was not run! => Does not affect push readiness! (-1.00 min)
1) cuda-9.2-gnu-7.2.0-release-debug => FAILED: passed=193,notpassed=1 => Not ready to push! (16.65 min)
REQUESTED ACTIONS: FAILED
The detailed test results showed:
$ grep -A 100 "failed out of" cuda-9.2-gnu-7.2.0-release-debug/ctest.out
99% tests passed, 1 tests failed out of 194
Subproject Time Summary:
Tpetra = 1094.70 sec*proc (194 tests)
Total Test time (real) = 139.04 sec
The following tests FAILED:
12 - TpetraCore_BlockCrsMatrix_MPI_4 (Failed)
Errors while running CTest
I pull again and test it but I cannot reproduce the error with a commit 3ea64d1.
@kyungjoo-kim said:
I pull again and test it but I cannot reproduce the error with a commit 3ea64d1.
Let's wait and see if @fryeguy52 can reproduce this on 'white' in his own account and go from there.
Note that #4293 disabled this test; we'll need to re-enable it when this work is complete. TpetraCore_BlockCrsMatrix_MPI_4_DISABLE
@kddevin said:
Note that #4293 disabled this test; we'll need to re-enable it when this work is complete. TpetraCore_BlockCrsMatrix_MPI_4_DISABLE
PR #4293 only disables that test for the CUDA PR build, not the ATDM Trilinos builds. (There is no relationship between these two sets of builds and that is on purpose.)
The question is if this failing test is something that should be fixed or not before that ATDM APPs get an updated version of Trilinos? Right now it is listed as ATDM Sev: Blocker
. (But my guess is that EMPIRE is not getting impacted by this because we would have heard about.) Is this a real defect in Tpetra or just a problem with the test?
From the failed test message,
unpackRow: numBytesOut = 4 != numBytes = 156.
[FAILED] (0.106 sec) BlockCrsMatrix_double_int_int_Kokkos_Compat_KokkosCudaWrapperNode_write_UnitTest
Location: /home/jenkins/waterman/workspace/Trilinos-atdm-waterman-cuda-9.2-debug/SRC_AND_BUILD/Trilinos/packages/tpetra/core/test/Block/BlockCrsMatrix.cpp:859
[waterman3:115046] *** An error occurred in MPI_Allreduce
[waterman3:115046] *** reported by process [3797417985,0]
[waterman3:115046] *** on communicator MPI_COMM_WORLD
[waterman3:115046] *** MPI_ERR_OTHER: known error not in list
[waterman3:115046] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[waterman3:115046] *** and potentially your MPI job)
I first examine the packing and unpacking routine if there is some mistake in dualview sync as the input array of the unpacking, imports
, is all zeros. Then, I see the All reduce error. I am not sure which trigger which error. However, it is also possible that MPI_Allreduce gets an error and it causes corruption in the importer. The other way is also possible. Something is not synced from device and it causes the MPI error. From the other test in BlockCrs, creating a map (this is not related to blockcrs code but it just happens in the blockcrs code) also fails with mpi all reduce.
@bartlettroscoe PR #4307 will fix the block crs unit test failures. After the PR is merged, would you please re-enable the test ?
@kyungjoo-kim said:
@bartlettroscoe PR #4307 will fix the block crs unit test failures. After the PR is merged, would you please re-enable the test ?
Thanks for fix!
Someone will need to revert PR #4293 after we confirm that these tests are fixed in the ATDM build (where this tests was never disabled, PR #4293 only disabled them in the Trilinos PR build controlled by the @trilinos/framework team).
@kyungjoo-kim Can we reenable that test now?
With the merge of PR #4307 on to 'develop' on 2/4/2018, the test TpetraCore_BlockCrsMatrix_MPI_4
seems to be passing in all of the ATDM Trilinos builds on 2/5/2018. See table below.
I will leave this open until the CUDA PR testing gets this test enabled again by reverting PR #4293.
Site | Build Name | Test Name | Status | Details | Consecutive Pass Days | Non-pass Last 30 Days | Pass Last 30 Days | Issue Tracker |
---|---|---|---|---|---|---|---|---|
waterman | Trilinos-atdm-waterman-cuda-9.2-debug | TpetraCore_BlockCrsMatrix_MPI_4 | Passed | Completed | 1 | 12 | 17 | #4257 |
waterman | Trilinos-atdm-waterman-cuda-9.2-opt | TpetraCore_BlockCrsMatrix_MPI_4 | Passed | Completed | 1 | 12 | 17 | #4257 |
waterman | Trilinos-atdm-waterman-cuda-9.2-release-debug | TpetraCore_BlockCrsMatrix_MPI_4 | Passed | Completed | 1 | 12 | 15 | #4257 |
white | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug | TpetraCore_BlockCrsMatrix_MPI_4 | Passed | Completed | 1 | 12 | 20 | #4257 |
white | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug | TpetraCore_BlockCrsMatrix_MPI_4 | Passed | Completed | 1 | 10 | 17 | #4257 |
white | Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release | TpetraCore_BlockCrsMatrix_MPI_4 | Passed | Completed | 1 | 11 | 18 | #4257 |
FYI: I created the revert PR #4326 to re-enable this test in the Trilinos CUDA PR build. Just need someone to approve this PR and get it merged. Then we can close this issue.
Thanks Ross! I just approved the PR.
CC: @trilinos/tpetra, @kddevin (Trilinos Data Services Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
With the merge of PR #4307 on to 'develop' on 2/4/2018, the test
TpetraCore_BlockCrsMatrix_MPI_4
seems to be passing in all of the ATDM Trilinos builds on 2/5/2018. Next: Get PR #4326 merged which re-enables this test in the Trilinos CUDA PR build ...Description
As shown in this query the test:
is failing in the builds:
It is failing with the following output:
@kyungjoo-kim can you see if one of these commits may have caused this?
Current Status on CDash
The current status of these tests/builds for the current testing day can be found here
Steps to Reproduce
One should be able to reproduce this failure on ride or white as described in:
More specifically, the commands given for ride or white are provided at:
The exact commands to reproduce this issue should be: